Literature DB >> 33914526

Machine Learning Meets with Metal Organic Frameworks for Gas Storage and Separation.

Cigdem Altintas¹, Omer Faruk Altundal¹, Seda Keskin¹, Ramazan Yildirim².

Abstract

The acceleration in design of new metal organic frameworks (MOFs) has led scientists to focus on high-throughput computational screening (HTCS) methods to quickly assess the promises of these fascinating materials in various applications. HTCS studies provide a massive amount of structural property and performance data for MOFs, which need to be further analyzed. Recent implementation of machine learning (ML), which is another growing field in research, to HTCS of MOFs has been very fruitful not only for revealing the hidden structure-performance relationships of materials but also for understanding their performance trends in different applications, specifically for gas storage and separation. In this review, we highlight the current state of the art in ML-assisted computational screening of MOFs for gas storage and separation and address both the opportunities and challenges that are emerging in this new field by emphasizing how merging of ML and MOF simulations can be useful.

Entities: Chemical Disease Gene Species

Keywords: Gas separation; Gas storage; High-throughput computational screening; Machine learning; Material design; Metal−organic frameworks; Modeling; Structure−performance relationships

Mesh：

Substances：
Metal-Organic Frameworks

Year: 2021 PMID： 33914526 PMCID： PMC8154255 DOI： 10.1021/acs.jcim.1c00191

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Metal organic frameworks (MOFs) have been named as one of “the top ten emerging technologies in chemistry” by the International Union of Pure and Applied Chemistry (IUPAC).[1] The large number of possible metal nodes and organic ligand variations for synthesis of MOFs leads to attractive physical and chemical features such as high thermal stabilities (as high as 500 °C), various porosities (0.3–0.9), large surface areas (>8000 m2 g–1), low densities (0.2 g cm–3), and wide range of pore sizes (3–100 Å).[2,3] The ability to tune physical and chemical properties of MOFs during or after synthesis by metal cation exchange,[4] attachment or insertion of functional groups,[5,6] allowed the researchers to generate various types of MOFs with desired properties for a specific application. MOFs have been studied for many different areas such as gas storage and separation,[7−10] bioimaging,[11] catalysis,[12,13] batteries,[14] supercapacitors,[15] and drug delivery.[16] Currently, there are 103 951 experimentally synthesized MOFs deposited into the Cambridge Structural Database (CSD),[17] and this number is continuously increasing due to the existence of theoretically infinite number of possible MOF structures. This very large number of MOFs is a great advantage for the potential applications of materials but, on the other hand, it is not practical to experimentally test the performances of all these MOFs even for a single application at the lab scale. A high-throughput computational screening (HTCS) approach enabled researchers to evaluate the performance of thousands of MOFs for a desired application in a time- and cost-efficient manner using molecular simulations.[18,19] Simulation results of large numbers of MOFs help to identify the most promising material candidates for desired application purposes. Results of these HTCS studies reveal an enormous amount of data about both the physical (e.g., pore size, surface area) and performance properties (e.g., selectivity, working capacity) of MOFs. The challenge here is that it is nearly impossible to see all the trends and relations in this large MOF data without additional computational resources. Machine learning (ML) aims to develop computer programs capable of learning from large data sets using statistics and some algorithms; it may help to identify the hidden patterns and relations in data or to construct models to correlate some input variables (descriptors) with the output (performance) of the system.[20] ML has been applied in various fields such as chemistry,[21] chemical engineering,[22] catalysis,[23] and energy related materials.[24−27] In addition to the availability of new and more effective ML algorithms, developments in computational technologies including better data storage, management, and retrieval capabilities have contributed to the rise of ML. New experimental (like high-throughput experimentation tools and high-resolution spectroscopy)[28,29] and computational tools (classical and quantum mechanical methods)[30] provide a large amount of accurate data, which is essential for ML applications, and in the meantime, advances in supercomputing capabilities, high-throughput computational workflow managers,[31,32] and open source accessible software packages dedicated to ML[33−38] make the field much more accessible to newcomers. ML has been applied in materials science[39−41] for different purposes including the prediction of polymer properties,[42] classification of zeolite structures,[43] discovery of drugs,[44,45] identification of peptides as antibacterial agents,[46] design of homogeneous catalysts from different ligands,[47] or detection of biological effects of nanoparticles.[48] MOFs are a relatively new class of materials compared to polymers or zeolites, and ML has been recently applied to the field of MOFs to determine the best materials among the existing ones for a specific use, to discover new and better materials from practically endless possible structures, and to unravel the correlations in the data obtained from molecular simulations of MOFs. These data were used to build quantitative structure–property relations (QSPRs),[49] to predict the mechanical stability of structures,[50] to estimate gas storage capacity of MOFs,[51,52] and to design novel MOFs by ML.[53] Analysis of MOFs with ML has accelerated in recent years[25,54−73] for a variety of fields such as identifying electronic structure properties of MOFs,[63,64] predicting colors of MOFs,[61] defining the oxidation states of metals in MOFs,[62] assigning partial charges to MOF atoms,[59,73] optimizing the swing adsorption process conditions with MOFs,[58,60] and for predicting the performances of MOFs as sensors,[55,56] heat pumps,[57] and gas storage and separation materials.[66,74−79] In this review, we first provide a brief introduction and background to the MOFs and ML methods and then address the current state of the art in ML applications of MOFs for gas storage and separation, which are the most studied application areas of MOFs to date, while the focus of most of the ML studies on MOFs is also on these two areas. We then address both the opportunities and the challenges in the intersection of two fast-growing fields and present a detailed perspective for the future of ML-assisted computational MOF studies.

Background

As the future is shaped by the past, focusing on the past trends in the fields of ML and MOFs may guide us to make projections for the possible future developments on the subject. The timeline for studies on ML, MOFs, and ML applications on MOFs is given in Figure . It is hard to describe the detailed timeline for ML studies because most of the popular and relatively old ML techniques such as linear regression (LR), artificial neural networks (ANN), decision tree (DT), and support vector machine (SVM) have been used in the literature without directly referring to the ML concept in early applications. Although the first appearance of the ML concept in the Web of Science database[80] goes back to the 1950s, the frequency of appearance is limited to the order of tens per decade until the 1990s; the acceleration started in the 1990s. The revolutionary development in computational technologies coincided with the spread of the Internet which revolutionized data sharing capabilities has not only increased the use of ML but also shifted the focus to large data sets in recent years. If we exclude LR, which has wider (and older) application areas, ANN seem to be the most commonly implemented technique in chemistry since the 1980s; there are even review articles on the subject published in the 1990s. A paper, dated 1991,[81] introduced the basic concept and implementation of ANN and reviewed the chemistry related applications such as analysis of spectroscopic data, determining amino acid sequences and protein structures, and classifying atomic energy levels. Burns and Whitesides[82] (in 1993) reviewed the early application of ANN in chemistry. The first review covering the chemical engineering applications of ANN was published in 2000 focusing on the area of fault detection, prediction of polymer quality, data rectification, modeling and control.[83] Early applications of ML had been mostly based on small data sets generated in a single laboratory; although such works are still valuable, the attention started to shift to large data sets in the late 2000s, and this trend intensified in the 2010s with the increasing availability of large data sets.

Figure 1

Timeline for developments in the areas of MOFs, ML, and ML-assisted screening of MOFs. The number of publications in MOF and ML fields per year is given in the inset figures (retrieved from the Web of Science[80] in February, 2021). The pioneering works on MOFs in the 1990s[84−86] attracted the interest of many scientists from various fields to the potential of these materials. Due to this interest, the publishing rate on topics including “metal organic framework” has increased dramatically, reaching up to 7000 articles per year in 2019, and the number of experimentally synthesized MOFs exceeded 100 000.[17] Many of the early publications focused on the synthesis of new MOFs whereas studies after the 2000s were more directed to the applications.[87−89] With the increasing demand to evaluate the performances of MOFs in a time-efficient and accurate manner to guide the experimental efforts, the first molecular simulations were performed in 2003.[90] Grand canonical Monte Carlo (GCMC) simulations have been very widely used to compute gas adsorption in MOFs. Computational screening of MOFs considering ten or more structures started in 2007 with the aim of evaluating the accessible surface area of materials.[91] Establishment of the first hypothetical MOF database by Wilmer et al.[89] was the real kick-off for the screening of MOFs. Since then, many HTCS of MOFs were accomplished using that hypothetical MOF database[92] or in-house databases of experimental MOFs.[93] The computation-ready experimental MOF database[94] was the real catalyst for HTCS of thousands of experimental MOFs using GCMC simulations. Generation of different databases involving various MOFs allowed the use of ML in conjunction with HTCS for structure prediction and estimating properties of MOFs as well as their potentials for gas storage and gas separation.[31,33] We represented the generalized implementation of ML to MOF studies in Figure .

Figure 2

Generalized implementation of ML on MOF studies. The data source is selected, preprocessed, and the descriptors to correlate the data are determined. The data is then fed into selected ML algorithm(s) to predict properties of MOFs. The results obtained from ML models are utilized for various applications, and analysis of the results help to determine new and better descriptors for discovery of more accurate models.

ML Basics

ML is usually implemented through a framework involving constructing a database, preprocessing of data, selecting descriptors, deciding on ML techniques, performing analysis, and interpretation of results steps even though the number and wording of steps may change from source to source. One can find the details of such frameworks in various publications.[20,95−97] Here, we briefly discuss three key ingredients (techniques, descriptors, and data) from the MOF perspective, and we summarize the major issues to be considered in ML implementation.

ML Tasks and Techniques

ML techniques are developed to perform some specific tasks such as regression, clustering, classification, or association,[96] and these tasks are also used to categorize the ML techniques even though some techniques can be used to perform more than one task with some modifications. Regression involves the development of models correlating the input and output variables so that the output of a new input set can be determined. The predictive models, like linear (or nonlinear) regression, have been around since long before the ML concept itself, and many additional techniques such as least absolute shrinkage and selection operator (LASSO) regression, ridge regression, ANN, random forest (RF), and support vector regression have been developed through years.[20,96−99] There are also variations of some techniques such as a recently popularized set of new ANN algorithms called deep learning.[100,101] Clustering refers to the task of grouping the data based on the similarities in input variables; it is usually used as a preanalysis tool before other ML techniques. Various techniques based on the centroid (like k-mean clustering), connectivity (like hierarchical clustering), distribution, and density are used for clustering the data.[102]Classification, on the other hand, categorize the data set into subsets based on the output variable; this way one can identify the values (or ranges) of input variables leading to those subsets, and determine the possible category of a new input set. The common classification techniques are decision trees, k-nearest neighbor algorithms, logistic regression, Bayesian classification, ANN, and SVM.[96] Finally, association aims to extract hidden relations among the variables (including output variables) in large data sets; it is used to identify the features (variables) that appear together in the same subsets.[103] The most important issue in selecting the ML technique is the task to be achieved. The characteristics of the data set (like size, homogeneity, type of variables, and so on) should be also considered.

Descriptors for MOFs

The descriptors, which are also called input variables, factors, features, or fingerprints, should be able to describe the systems in a way that the differences in output variables due to the change in descriptors could be explained in desired accuracy. In other words, the selected descriptors should be strongly correlated to the output variable. In material research, including MOFs, the selection of the descriptor set can be quite complicated due to the presence of a large number and diversity of potential candidates. Ward et al.[104] identified 145 potential materials descriptors and classified them into four groups, stoichiometric attributes, elemental property statistics, electronic structure attributes, and ionic compound attributes, while researchers also categorize the descriptors as atomic/molecular and structural properties.[105,106] While the atomic/molecular descriptors are used for the intrinsic properties such as molecular weight, elemental composition, atomic/molecular radius, charge, and so on, the structural descriptors refer to bulk properties such as crystal structure, pore size, and surface properties. The final descriptor set will vary from model to model even for the same data set depending on the knowledge to be extracted. However, the smaller ML models (i.e., less descriptors) are more robust and informative;[20] hence, the number of descriptors is usually reduced with an action called dimensionality reduction. This may be done by either eliminating some less significant descriptors (feature selection) or creating a smaller number of new descriptors from the original list (feature extraction)[20] using a method like principal component analysis. Some new approaches are also provided such as the ML neural-network representation of density functional theory (DFT) potential-energy surfaces to describe the energy and forces as a function of atomic positions,[107]smooth overlap of atomic positions.[108] The atomistic structure-learning algorithm(109) and many-body tensor representation(110) were also suggested to create the descriptors representing the atomic/molecular systems better. Various descriptors from the synthesis conditions (synthesis techniques and parameters, solvent types, microwave power, organic linkers, and metal precursors) to properties obtained from ab initio calculations (cohesive energies and force components) for the accurate molecular simulations[111] have been also used for MOF research. The classification of descriptors such as elemental versus structural(59) or chemical versus structural(50,112) was also used while Chong et al.[26] grouped them as geometrical, chemical, topological, and energy-based as illustrated in Figure . We classified the descriptors as user defined (such as synthesis method, linker, metal, and so on) versus structural (like void fraction, pore diameters, and so on) in one of our works to draw attention to the idea that the structural properties are the results of user defined descriptors; hence, strong correlations, which should be avoided in some ML techniques, may exist between two groups.[113] There are also newly created descriptors like atomic property weighted radial distribution functions (AP-RDF)(51) to predict the gas uptake more accurately.

Figure 3

Classification of descriptors used in MOF related ML studies. Examples of descriptors are given for each class. The topological representations are taken from the Reticular Chemistry Structure Resource (RCSR).[114]

Sources of Data

The most common approach in ML analysis of MOFs is to acquire the MOF related information from the experimental sources such as the computation-ready experimental (CoRE) MOF database[94,115,116] or computationally produced data such as hypothetical MOFs,[89,117−119] while the applications (such as gas uptake) are usually computed in house (such as with GCMC simulations). The most critical issue related to the MOF data is that HTCS[120−122] and ML[51,123] on MOFs generally focus on computation-ready databases; however, these databases consist of different MOFs with different properties, and the range of properties in one database may have huge impacts on the transferability of results or models to other MOF databases. Moosavi et al.[124] used ML to quantify the structural and chemical similarities of MOFs in different databases, to clarify if synthesizing a new MOF will be able to increase the variety of MOFs. They used CoRE MOF[116] and hypothetical MOF databases of different groups[89,117,125,126] for this purpose. The Henry’s constants for CO2 and CH4 and gas uptakes at various pressures were computed using GCMC simulations and results revealed that each MOF database has distinctly different distributions of properties. Furthermore, the importance analysis resulted in different ranking of the descriptors for gas adsorption if the ML models were constructed using different MOF databases. These results indicate the possibility of biases in ML analysis and question the transferability of models and results among the databases while it may also indicate the necessity of agreed upon standard protocols in generating, reporting, and storing the MOF data.

Critical Issues in ML Implementation

The most critical step in implementation of ML is descriptor selection, which may involve feature selection or feature extraction for choosing proper descriptors. If the selected descriptors are not predictive of the target variable and do not capture the right information, the accuracy of results will be poor regardless of the model. Model selection is another crucial step in implementation of ML, which involves the construction of various candidates, with various values of model hyperparameters (like number of hidden layers, number of nodes, and learning rate in ANN) and selecting the one that represents the data best. This is usually performed by constructing (training) candidate models using some portion of data with a screening strategy (like grid search) for the values of model hyperparameters within plausible ranges, and testing the models with the remaining (unused during model building) data; the performance of the candidate models is evaluated using some measures for fitness (like root-mean-square error) and the one that represents the data best is selected as the final model. The model selection step should be executed without data leakage, which refers to the transferring information from testing (i.e., using the same information in training and testing). The simplest procedure, which is called holdout testing, divides the data into two subsets (for example 80% and 20%) randomly; the large subset is used to construct (train) the model while the small subset is used for testing. Two important mistakes are often made during this procedure. First, the training and testing sets may contain the same information; this usually happens when pairs of data points with very similar information are divided into training and testing resulting in overtraining and, therefore, poor generalization ability of the model. The second problem is that the researcher may not like the testing performance of the model and repeat the procedure by changing model parameters until the desired testing results are obtained; this also represents leakage because the training and testing are not independent anymore. K-Fold cross validation may allow building of a feedback loop during model construction without leakage. This procedure also requires the separation of testing set (let say 20%) before model construction. The remaining data is further divided into k subsets; k – 1 subsets are used for training while the remaining one is used for validation, in rotation, to tune the model hyperparameters. The model (i.e., the set of model hyperparameters) that resulted in the best average performance is selected as the final and tested using the testing data separated in the first step. Unfortunately, the validation is sometimes treated as the testing and used for both model selection (determining the model hyperparameters) and testing the model fitness, again, causing data leakage. Another possible cause of information leakage is the normalization/standardization of data before splitting into training and testing (carrying information related to the testing data to the model construction); to prevent this, the normalization/standardization should be applied to the training data set, and then the testing set should be normalized using the mean and variance of training. Preprocessing of data before the ML modeling can be also crucial for the success of analysis. To begin with, the missing values, duplications, or inconsistencies in the data, which are especially common in the data sets constructed from multiple sources, should be eliminated (inconsistencies among the MOF data, as reported by Moosavi et al.,[124] were briefly discussed in the previous section). Various transformations on the descriptors (like normalization, standardization, or encoding) may also be needed while the dimensionality of the data can be reduced for smaller and more robust models as discussed above. Additionally, some techniques may require more specific actions; for example, the unbalanced data is a serious problem in DT classification; briefly, the number of data points in classes should be approximately the same. Otherwise, the incorrectly classified instances from large classes are placed into the neighboring small classes, and this decreases their prediction accuracy because, even if the fraction of misplaced data is small in its source (large class), it may constitute a significant portion of data in the target (small class). If the use of equal size classes is not practical, the problem can be fixed by random sampling (duplicating some data points) to increase the size of the small classes approximately to the size of the largest.[127] Unbalanced data (i.e., a heavily tailed data set) may be challenging for the predictive models as well. The error should be also analyzed and reported properly to assess the true potential of ML models. Various measures such as root-mean-square error (RMSE), percent accuracy, correlation coefficient (R), coefficient of determination (R2), recall, and precision are used depending on the ML technique employed; the most common mistake on this issue is to report the validation error as testing error as the result of the confusion discussed above. Additionally, the distinction between two types of uncertainty should be also made. One type (also called as aleatoric uncertainty) is associated with the data set; all experimental measurements (and most of the computational tools) have certain levels of uncertainty. This type of uncertainty cannot be reduced during ML application. The second type of uncertainty (also called as epistemic uncertainty) arises from the lack of knowledge in practice even though the knowledge exists in theory; for example, the failure to identify an important descriptor or lack of sufficient representation of certain effects in the data set may be considered in this category. This type of uncertainty may be reduced by designing better descriptors which can be strongly related to target variables, and it should be done as much as possible.[128,129]

Implementation of ML Algorithms for MOFs

ML has generally been used to predict the stability of MOFs, to unlock the gas storage and gas separation potentials of MOFs, and to design novel MOFs. Among various applications, the most mature application field of MOFs is gas storage and separation. ML studies on gas storage and separation performances of MOFs generally focus on structure–performance relationships to select the best descriptors and/or introduce new ones that can accurately predict the gas uptakes and selectivities of MOFs in a time-efficient manner. We presented some recent, representative studies with a concise and forward-looking overview.

Design and Discovery of New MOFs

ML has been used as a tool for predicting the structural properties of MOFs and generating well performing MOFs. For example, the SVM algorithm was used with molecular descriptors to define the most important properties that can lead to porous structures, and 481 porous, mechanically stable structures were identified among 156 333 CSD-derived experimental crystal structures.[130] The surface area descriptor provided an important insight for determining porous and stable structures prior to experimental studies. Also, 3385 hypothetical, computer-generated MOFs (containing 14 types of organic ligands, 28 different types of organic or metal-based nodes, and 41 topologies) were used to predict bulk and shear moduli from chemical–structural properties via combining HTCS with the ANN algorithm.[50] Topology and coordination number were identified as the most important factors for the mechanical stability of MOFs as they determined the energy cost for the changes in bond lengths and bond angles. Following the comprehensive investigation of bulk and shear moduli with respect to structural properties, a web-based tool was provided for researchers to examine structure–mechanical stability relations of MOFs.[131] The failed experiments, which are basically the synthesis conditions that did not lead to porous, crystalline MOFs, have been also considered as the source of data as the parameters and conditions leading to failure should be also known.[132] In such a work, robotic MOF synthesis was combined with genetic algorithm (GA) and chemical intuition obtained from failed experiments was quantified for the first time, to extend the knowledge on how the building blocks self-assemble into one of the most widely studied MOFs, HKUST-1, at different synthesis conditions.[133] The most optimal conditions, such as the type of solvent used, reactants ratio, microwave power, and reaction time as well as their relative importance for the synthesis of HKUST-1 with the largest surface area were determined using the RF algorithm. The model predicted the synthesis outcome depending on the synthesis conditions with a testing error of 14%. This approach highlighted the importance of reporting the information obtained from failed experiments, and it may lead the way to the development of transferable synthesis methodologies that can be used to synthesize different MOFs with desired porosity and surface area without tedious repetition of experiments at many different conditions. Industrial application of MOFs mandates stability under humid and/or aqueous conditions, which is not straightforward to determine with experimental and computational methods. SVM, RF, and GB algorithms were utilized to screen MOFs for their water stability using three categories of descriptors: metal node, organic linker, and molar ratio of the number of organic linkers, oxygen, hydroxyl, and water groups to the number of metals in MOFs.[134] RF and SVM models were trained with the experimentally determined water stability data of 207 MOFs,[135] and accurately predicted the water stability of 10 MOFs when atomic radius and ionization potential of the metal ion and number of cyclic divalent nodes and six-membered rings were used as the descriptors. That work is a good example of the utilization of ML models to prioritize the experimental synthesis of stable MOFs. Recently popularized deep learning algorithms have been also utilized in MOF research. Combining Monte Carlo tree search with the recurrent neural network (RNN) algorithm, as shown in Figure a, is a useful approach to design new well-performing MOFs with high density of adsorption sites for CH4 and CO2 gases.[53] Performances of the designed MOFs were tested for CO2 and CH4 storage after they were hypothetically produced. Due to the design of pore space and organic linkers, hypothetic MOFs were found to have higher deliverable CH4 capacity, the amount of gas that can be stored between adsorption and desorption pressures, and higher CO2 uptake than the existing MOFs having the same metal node and topology.

Figure 4

Design and discovery of new MOFs for gas storage. (a) Flowchart of tailor-made design of MOFs with machine learning for methane storage and carbon capture (Reproduced with permission from the work of Zhang et al.[53] Copyright 2020 American Chemical Society). (b–d) Workflow of the ML algorithm of Bucior et al.[52] using H2-MOF energy histograms as descriptors (Adapted with permission from ref (52). Copyright 2019 Royal Society of Chemistry).

Gas Storage Performances of MOFs

CH4, H2, and CO2 storage capacities of MOFs are in scope since the development of an appropriate adsorbent with optimum storage and delivery capacities for these gases requires knowledge of the factors defining the materials’ performance. Consequently, the works published mainly aimed to identify the significant chemical or structural properties or select the most suitable MOF structure. While the performances of various ML techniques are compared in some works, there are also reports that the ML techniques were used together in a complementary manner to enhance the knowledge extracted. Most ML studies have focused on the CH4 storage capacity of MOFs. For instance, CH4 adsorption data of 137 953 hypothetically designed MOFs were computed using GCMC simulations.[89] These MOFs were later used to develop QSPR models to reveal relations between CH4 storage capacities of MOFs and simple geometric descriptors such as pore size, surface area, and void fraction.[49] Using the SVM models, new hypotheses about combinations of material properties that might lead to very high CH4 storage capacities were proposed. Simple multivariable linear models were also successfully used to investigate the structural factors that determine CH4 uptake capacity of a small number of experimentally reported MOFs at various pressures.[112] According to these models, high gravimetric surface area, ∼6000 m2 g–1, and high porosity, ∼0.9, were required to reach the Department of Energy (DOE) target for CH4 uptake capacity (0.5 g g–1) at 35 bar. This analysis was extended to CH4 uptake data of 2224 experimentally synthesized MOFs using DT and ANN algorithms to determine the effect of user-defined descriptors (such as metal type, crystal structure, metallic percentage) and structural descriptors (such as pore volume, maximum pore diameter) on deliverable CH4 capacity.[113] An ML algorithm on accurately predicting CH4 uptake capacities of 4763 CoRE MOFs and 69 840 hypothetical covalent organic frameworks (COFs) at 5.8 bar and 298 K was proposed.[136] In the proposed approach only 100 materials are used in the training set to predict the top materials with the highest CH4 uptake capacities. The predicted top materials are then added to the training set, and a new prediction is made. This procedure was repeated until the training set included the last predicted 100 top materials. The model was able to identify most of the top 100 materials in both CoRE MOF and hypothetical COF databases using only a small fraction of materials in the subsets. Selection of the most suitable ML tools may depend on the size and structure of data; hence a significant number of researchers have compared the performances of various techniques and selected the one that has the highest fitness. In such a work, a topological barcode system,[137] including information about the pore geometry of materials, was combined with different ML algorithms, including RF, DT, kernel ridge regression (KRR), and SVM, to predict structural properties and deliverable CH4 capacities of zeolites and MOFs, which were previously computed with GCMC simulations.[138,139] Structural descriptors included void fraction, surface area, density, pore diameter, and information about interpenetration while chemical descriptors included atom type, degree of unsaturation, metallic percentage, oxygen to metal ratio, nitrogen to oxygen ratio, and electronegativity. The RF model using both chemical and structural descriptors led to the most accurate CH4 uptake predictions. The use of the RF algorithm was extended to 69 839 hypothetical COFs to predict their deliverable CH4 capacities utilizing structural and chemical descriptors.[75] The use of chemical and structural descriptors increased the accuracy of ML predictions for COFs, especially at low pressure. Those works emphasized the need for the consideration of chemical descriptors for accurate prediction of gas uptakes of nanoporous materials at low pressures. Similar works have been performed for other gases as well. For instance, the RF algorithm was utilized to predict the CO2, H2, and H2S adsorption capacities of 2932 CoRE MOFs using structural properties of MOFs along with a new descriptor, the probability of a set of different probe atoms to be adsorbed by the material.[140] Three different probe atoms were modeled for adsorption probability calculations: Vprobes (neutral, nonpolar probes), Dprobes (neutral, small dipole moment), and Qprobes (small charge in the center). The number of MOFs used in the training set was varied from 50 to 1000 while the rest of the MOFs were used for the validation set. The model was built considering structural properties, and the adsorption probability of Vprobes and Dprobes had the best R2 value for the adsorption capacity of each gas (0.92 for CO2, 0.94 for H2S, and 0.97 for H2). This work showed that the adsorption capacities of MOFs for different gases can be predicted by using adsorption probability of pseudo atoms without the need for expensive simulations. Similarly, a large variety of ML techniques such as multilinear regression (MLR), DT, kNN, SVMs, ANNs, and RF was used to develop QSPR models based on binary classifiers that are built from void fraction, surface area, and pore size of MOFs to predict CO2 uptake capacities of hypothetical MOFs.[141] The model built with the RF algorithm had the highest accuracy (over 94%) for uptakes of both gases. Moreover, the model identified over 60% of high-performing materials for CO2 and N2 adsorption in a large and diverse set of ∼65 000 MOFs, setting a great example that ML techniques can lead to a significant time efficiency in screening large databases. In another work, H2 adsorption isotherms obtained from GCMC simulations were coupled with the NN algorithm to analyze the limits of H2 storage in >850 000 nanoporous materials including MOFs, COFs, ZIFs, PPNs, and zeolites.[142] Results revealed that H2 storage capacities of hypothetical materials could not exceed the performances of the experimentally synthesized materials reported to date. This was a motivating result for the research community working on adsorbent materials to change the common approach for designing better materials. With the aim of finding materials with optimum binding energy to reach high deliverable H2 capacities, the LASSO algorithm was used to develop a model which uses the guest–host energy histograms as descriptors instead of structural properties to predict deliverable H2 capacities between 100 and 2 bar, at 77 K.[52] The overall procedure of the analysis is given in Figure b while the H2 adsorption capacities obtained from GCMC versus ML for testing is given in Figure c indicating that the model is quite successful. GCMC simulations were performed for H2 uptake of 137 953 hypothetical MOFs. The adsorbate–MOF potential energy landscape was sampled, and the interaction energy between the framework and an H2 probe was computed to construct the energy histogram creating the feature matrix for the hypothetical MOFs. The LASSO model was used to predict the adsorption energy from these features. The model was then used to predict the H2 uptake of 54 776 experimentally synthesized MOF structures in the CSD. Finally, the model was verified experimentally using the H2 capacity of MFU-4l, one of the top identified MOFs with the ML model for H2 storage capacity, and a good agreement was obtained as shown in Figure d. To define the extent of volumetric H2 storage capacity depending on adsorbent and operating conditions, Anderson et al.[143] used an ANN algorithm to predict H2 loadings of 105 hypothetical MOFs with 17 different topologies at multiple temperature and pressure conditions. Under nonisothermal conditions, changing the storage pressure from 100 to 35 bar did not significantly affect the deliverable H2 capacity of a top MOF while changing pressure under isothermal conditions led to a more significant drop in deliverable H2 capacity. That work demonstrated that it is possible to arrange the set of conditions for a material to acquire the target performance, and ML can provide a very fast means of arranging conditions. With the experience gained on application of ML algorithms for gas storage in MOFs, ML applications have been extended to the different classes of nanoporous materials, different gases, and more effective descriptors and procedures. For instance, the ANN algorithm was used to develop a model to predict adsorption isotherms of 2400 topologically and chemically diverse MOFs for small, near-spherical, nonpolar, and mono- and diatomic adsorbates (alchemical species) at different pressures.[79] The model efficiently predicted Ar, CH4, Kr, Xe, N2, and C2H6 adsorption in MOFs. So that, for the first time via combining the descriptors of both adsorbate and adsorbents in the same ML model, aiming to make predictions for any new guest/host system without any prior knowledge of the components, the ML predictions for various gases were made without the need of providing any specific descriptors during the training of the model. The use of new molecular descriptors, atomic property weighted radial distribution functions (AP-RDFs), captures geometric features of materials while they utilize tabulated atomic properties such as electronegativity. To build QSPR models for CO2, CH4, and N2 uptake of MOFs, the MLR and SVM algorithms were used with descriptors based on AP-RDF.[51] Approximately 58 000 unique hypothetical MOFs[89] out of ∼83 000 were used to train and calibrate SVM and MLR algorithms while the remaining MOFs were used for training. QSPR models developed with the SVM algorithm generally predicted simulated gas uptake with a higher accuracy than MLR. A comprehensive QSPR model that can predict high-performing (CO2 uptake > 1 mmol/g at 0.15 bar and > 4 mmol/g at 1 bar) and low-performing MOFs, depending on the CO2 uptake at conditions relevant to postcombustion (0.15 bar) and landfill gas purification (1 bar), was developed for 324 500 hypothetical MOFs.[144] CO2 uptake data calculated with GCMC simulations were used to train and validate the ML models. 10% of the database was used to train the SVM algorithm again with AP-RDF descriptors while the results of the developed QSPR model was validated using the rest of the database. The QSPR model successfully predicted 945 high-performing MOFs (CO2 uptake capacity > 1 mol/kg) out of 1000 top MOFs identified with GCMC at 0.15 bar. The work by Guda et al.,[145] which involves the use of ML to investigate the structural parameters of CO2 adsorption on a CPO-27-Ni MOF using its X-ray absorption near-edge structure (XANES) spectra, indicates that the applications may extend to more diverse areas of MOF research in the future.

Gas Separation Performances of MOFs

MOFs have been widely investigated for gas separation because of their high porosities, chemical tunability, and diversity. However, screening large databases of MOFs to find well-performing materials is very time-consuming; hence ML techniques were used in recent years to predict the gas separation performances of large numbers of MOFs and to identify the structural parameters for designing high-performing MOFs. One of the common ML applications in this area is the analysis of CO2 capture capabilities of MOFs. For example, the effect of pore chemistry and topology on CO2 capture performances of 400 hypothetical MOFs was studied by combining molecular simulations with ML models.[126] The computational data for building the ML models was generated by GCMC simulations for pure CO2 adsorption, CO2/H2, and CO2/N2 mixture adsorption by mimicking industrial conditions. To be able to focus on a particular region of the MOF structure-space, a set of MOFs encompassing all possible combinations of 16 topologies and 13 functionalized molecular building blocks was considered as shown in Figure a. Various ML algorithms such as MLR, SVM, DT, RF, and NN were used to build predictive models for CO2 capture metrics of 31 parent MOFs and their derivatives while the DT algorithm was used to predict the improvement or deterioration of CO2 capture performances of MOFs after functionalization. GCMC-computed versus ML-predicted CO2/N2 selectivity for testing is given in Figure b. The relative importance of descriptors changes significantly with the change of gas mixture composition as shown in Figure c. Results also revealed that functionalization of MOFs with thiol, cyano, amino, and nitro groups often improved CO2 capture performances of MOFs depending on their topology. That work elucidated the importance of pore chemistry for determining the gas separation potential of MOFs with simulations and ML models. In another work, the gradient boosted trees regression (GBTR) method was used to predict CO2 working capacity and CO2/H2 mixture selectivity of a topologically diverse set of hypothetical MOFs.[146] The data required to build the models was generated by GCMC simulations on the hypothetical MOFs for separation of CO2/H2 mixtures. Six different geometric descriptors and three chemical descriptors, which were constructed using AP-RDF, were used. The chemical descriptors were found to be more important for accurately predicting adsorption performance of MOFs for precombustion CO2 capture.

Figure 5

Effect of pore chemistry and topology on CO2 capture performances of MOFs. (a) Topological nets and molecular building blocks used to construct MOFs. (b) Comparison of different model predictions with GCMC results for CO2/N2:15/85 mixture selectivity of MOFs.[126] (c) Relative importance of descriptors obtained from GBM training for CO2/H2 and CO2/N2 separation performances of MOFs (Reproduced with permission from the work of Anderson et al.[126] Copyright 2018 American Chemical Society). Separation performances of MOFs can be improved wisely using different ML algorithms even before experimental synthesis. A deep generative model, supramolecular variational autoencoder (SmVAE), was used to automate the design of MOFs in a way that increases their performances for CO2/CH4 and CO2/N2 separation.[147] First, MOFs were deconstructed into their building blocks to obtain the collections of edges, vertices, and topologies following the methodology provided by Bucior et al.,[148] and then these were analyzed using descriptive statistical tools. Using the edges, vertices, and topologies, two million MOFs were generated, and the top MOF candidates, which were identified by GCMC simulations, had higher/similar working capacity and selectivity than the best-performing MOFs reported in the literature. Capture of dilute CO2 from air could be a useful process to prevent climate change. For that purpose, the CO2 capture potential of 6013 CoRE MOFs from air was predicted using four different ML algorithms (back-propagation neural network (BPNN), DT, RF, and SVM) with structural descriptors such as pore sizes, volumetric surface area, and heat of adsorption (Qst).[72] To train the models, a maximum of 1000 MOFs were used while the remaining MOFs were used for the validation of the ML models. The ML model built with RF algorithm was reported to have the best prediction performance of CO2 adsorption selectivity (R2 = 0.98) for the validation set and the relative importance analysis revealed that Qst is the most important parameter to predict adsorption selectivity for CO2. The excellent agreement of ML results with simulation results in that work highlights the potential of using ML models to achieve rapid and accurate screening of MOFs for CO2 capture even from dilute streams like air. ML techniques have been also applied to predict the gas separation performances of MOFs for other relevant gas mixtures for industry such as Xe/Kr separation.[149] The RF algorithm was used to build the models considering six common structural descriptors as well as a newly developed descriptor, Voronoi energy of Xe, which is the average energy of an Xe atom at the Voronoi nodes of the accessible pore space. The data for training the models were generated by GCMC simulations for 15 000 randomly selected structures among 670 000 porous materials (MOFs, PPNs, ZIFs, COFs, and zeolites) acquired from various sources, and Xe/Kr selectivities of the remaining structures were predicted by the models. Molecular simulations were then performed for 20 000 structures which the ML model deemed to be high performing, Xe/Kr selectivity >11, to obtain more accurate results from GCMC simulations. Many materials were predicted to have better separation performance for Xe/Kr mixture than a leading material, CC3. However, there was not a simple recipe for the geometric descriptors to guarantee that a material would be good for Xe/Kr separation. Defects in MOFs can provide different adsorption sites for gas molecules, or facilitate gas diffusion via increasing the pore size, and can change the separation performances of MOFs. The significance of defects for ethane/ethylene separation performances of 425 hypothetically created MOF (UiO-66) structures with missing linker defects were also studied by GCMC simulations and ML algorithms.[150] Among six different ML models trained with structural and chemical descriptors, LR model accurately predicted the working capacity, selectivity, and shear modulus (R2 > 0.98) while RF model successfully predicted the bulk modulus (R2 = 0.92). Gravimetric surface area and pore volume were found as the two most important descriptors to predict the defect concentration of UiO-66 MOFs. LR model generally led to good predictions while kNN and SVM models provided overall good fittings.

Outlook

In the light of astonishing developments in experimental synthesis techniques, computational power, computation-ready material databases, and data sharing capabilities observed in the last two decades, we can expect that the use of ML techniques in MOF research, as any field of science, will likely increase simply because we will have much more effective algorithms and computational tools in the future. The availability and quality of data will grow as a result of continuously improved experimental and computational research tools. The recent trends such as open-access publications as well as increasing data sharing options such as data repositories and new databases will also contribute to data availability while the existing databases can be expected to transform into more ML friendly formats (i.e., more suitable for automated data extraction). Additionally, more established databases, similar to those in material research such as Material Project,[41] OQMD (Open Quantum Materials Database),[151] AFLOWLIB (Automatic Flow),[152] Computational Material Repository,[153] and AiiDA (Automated Interactive Infrastructure and Database for Computational Science)[154] are likely to be developed in MOF specific research as some examples are emerging: the MOFDB (MOF database of Snurr Research group),[155] Quantum MOF database,[58] and Nanoporous Materials Adsorption Energy Database.[156] The ML techniques usually require a large data set to reduce the uncertainty in models developed or knowledge extracted, and creating such large data sets are difficult and costly; hence, the new ML algorithms and approaches that can be used with small data sets are also searched,[76,157,158] and this trend may be expected to grow in the future. Another approach that may be expected to spread more in the future is automated machine learning (AutoML), which aims to combine and automate the machine learning steps discussed above.[159,160] It was addressed by one of the earliest studies on large scale screening of MOFs by ML methods that automated predictive analytics pipelines can learn predictive models that generalize to new, unseen MOFs.[100] As more people, including nonexperts, are anticipated to involve in ML with the increasing popularity of subject and availability of suitable data, automated ML tools are also likely to be more preferred. It is of the utmost importance to use the most updated, accurate, consistent, and comprehensive data set in ML studies. We expect future experimental and computational studies on MOFs to share their methodology and results in complete detail in a ML-friendly format (available to use with different scripts or software) in addition to graphs, figures, and tables for reproducibility and comparability of data utilized in ML studies. It is also important to report the results of the failed experiments and performances of bad-performing materials to achieve a comprehensive range of structure–property relationships using ML. Increasing the number of MOFs will bring a diversity in MOF properties such as chemical composition, topology, and pore sizes, and this will pave the way for new computation-ready experimental MOF databases. These MOF databases will enlarge the scope of HTCS studies to a wider range of MOF properties. However, MOFs reported in different computation-ready databases may not have the same range of structural properties,[124] and particularly, computational databases derived from experimental crystal structures do not contain all perfectly clean structures.[161−163] Therefore, it is important to consider the chemical diversity and consistency of MOFs in different libraries when assessing the results of ML studies. It can be useful if future ML studies combine and examine MOFs from different databases which would help to build generalized ML models for MOFs. The choice of informative, representative, and uncorrelated descriptors is important for ML studies on MOFs. We expect that future ML studies will consider and utilize various chemical descriptors widely. In terms of physical properties, topological descriptors provide valuable insights and can effectively be used for gas storage and separation studies. It is highly expected that ML studies for MOFs will continue to mostly use the results of HTCS studies in the future. Therefore, the assumptions and approaches used in the HTCS studies, such as selection of the charge assignment method and force field parameters, significantly affect the results. It would be very useful to utilize and compare different assumptions/approaches of HTCS to analyze their effects on the overall results of ML-assisted computational MOF studies. Defining force field parameters using ML, even for flexible MOFs, can also be a new research direction. ML-based methods can be used to predict molecular energies for small molecules with lower costs.[164] In the future, this might be possible for MOFs and increase the accuracy of their screening without requiring significant computational time. Even workflows can be designed to process large data sets of MOFs and identify the most promising MOFs for which further computational and experimental analysis can be made. Combination of MOFs with other materials such as polymers or ionic liquids also brings an additional area of research which requires investigation with computational modeling and ML to gain valuable insight about distinct properties of these new MOF-based composite systems. Results of ML studies depend on the accuracy of various factors. It is only possible with experimental testing to determine the validity of ML results in identifying the top performing MOFs. Thus, stronger communication between experimental studies and theoretical studies is needed. Future studies examining the synthesizability of hypothetical MOFs can significantly contribute into the experimental efforts and potentially accelerate the commercialization of MOFs if more studies focus on predicting mechanical properties of MOFs with descriptors such as organic linker and metal type, temperature, and pH of solvent pool before experimental synthesis. Boyd et al.[165] recently identified two MOFs among 325 000 hypothetical MOFs by utilizing the concept “adsorbaphore”, which describes the pore shape and the chemistry of the optimal binding site in a MOF and proved their high CO2 selectivities from dry and wet flue gas with experimental methods. These topics can be further investigated by a collaboration of chemists, computer engineers, chemical engineers, and material scientists in many interdisciplinary studies while ML studies on extracting scientific knowledge from published articles also provide new insights for the synthesis and characterization of nanomaterials.[166,167] To conclude, ML-assisted HTCS of MOFs has great prospects for the future. MOFs have proven their potential for various applications including gas storage and separation, sensing, drug delivery, and catalysis, while the incorporation of ML to the research about MOFs is still a fresh topic with many unexplored areas. In this review, we considered ML studies which focus on the design and discovery of MOFs with high gas storage and separation performances. However, incorporation of ML into MOF research is not limited to only gas storage and separation applications. Nowadays, ML is utilized with acceleration to provide insights also for the use of MOFs in other applications. The most promising working fluid–MOF combinations are investigated with ML and structure–performance relationships revealed the best combination of properties required for the design of an efficient adsorption driven heat pump.[57] ML algorithms trained with MOF electronic noses to detect volatile organic oils can pave the way for the use of MOFs as sensors.[56] With the control of morphology and crystal growth of MOFs, ML can assist the synthesis of nanoscale MOFs in the desired morphology and properties.[168,169] Specification of the electrochemical properties of MOFs according to the synthesis conditions can be determined using ML which facilitates the design of MOFs for Li-ion batteries.[170] In addition to these, drug delivery performance, conductivity, catalytic performance, toxicity, and biocompatibility of MOFs are yet to be thoroughly researched, and utilization of ML can provide useful insights for these topics. Thus, we are excited for the future advancements in this field which will guide many material scientists to discover new and fascinating aspects of MOFs.

Data and Software Availability

Since this is a review article, the authors do not report any data or software.

50 in total

1. A fast learning algorithm for deep belief nets.

Authors: Geoffrey E Hinton; Simon Osindero; Yee-Whye Teh
Journal: Neural Comput Date: 2006-07 Impact factor: 2.026

2. Optimizing nanoporous materials for gas storage.

Authors: Cory M Simon; Jihan Kim; Li-Chiang Lin; Richard L Martin; Maciej Haranczyk; Berend Smit
Journal: Phys Chem Chem Phys Date: 2014-01-07 Impact factor: 3.676

Review 3. Machine learning: Trends, perspectives, and prospects.

Authors: M I Jordan; T M Mitchell
Journal: Science Date: 2015-07-17 Impact factor: 47.728

4. Geometrical Properties Can Predict CO2 and N2 Adsorption Performance of Metal-Organic Frameworks (MOFs) at Low Pressure.

Authors: Michael Fernandez; Amanda S Barnard
Journal: ACS Comb Sci Date: 2016-04-13 Impact factor: 3.784

5. Beyond the BET Analysis: the Surface Area Prediction of Nanoporous Materials Using a Machine Learning Method.

Authors: Archit Datar; Yongchul G Chung; Li-Chiang Lin
Journal: J Phys Chem Lett Date: 2020-06-08 Impact factor: 6.475

6. Metallic Metal-Organic Frameworks Predicted by the Combination of Machine Learning Methods and Ab Initio Calculations.

Authors: Yuping He; Ekin D Cubuk; Mark D Allendorf; Evan J Reed
Journal: J Phys Chem Lett Date: 2018-07-31 Impact factor: 6.475

7. Analysis of CH₄ Uptake over Metal-Organic Frameworks Using Data-Mining Tools.

Authors: Zeynep Gülsoy; Kutay Berk Sezginel; Alper Uzun; Seda Keskin; Ramazan Yıldırım
Journal: ACS Comb Sci Date: 2019-03-13 Impact factor: 3.784

8. Growing and Shaping Metal-Organic Framework Single Crystals at the Millimeter Scale.

Authors: Alessandro Sorrenti; Lewis Jones; Semih Sevim; Xiaobao Cao; Andrew J deMello; Carlos Martí-Gastaldo; Josep Puigmartí-Luis
Journal: J Am Chem Soc Date: 2020-04-28 Impact factor: 15.419

9. Data-driven design of metal-organic frameworks for wet flue gas CO₂ capture.

Authors: Peter G Boyd; Arunraj Chidambaram; Enrique García-Díez; Christopher P Ireland; Thomas D Daff; Richard Bounds; Andrzej Gładysiak; Pascal Schouwink; Seyed Mohamad Moosavi; M Mercedes Maroto-Valer; Jeffrey A Reimer; Jorge A R Navarro; Tom K Woo; Susana Garcia; Kyriakos C Stylianou; Berend Smit
Journal: Nature Date: 2019-12-11 Impact factor: 49.962

6 in total

1. GCMC and electronic evaluation of pesticide capture by IRMOF systems.

Authors: Nailton Martins Rodrigues; Matheus Figueiredo de Souza; José Roberto Dos Santos Politi; João Batista Lopes Martins
Journal: J Mol Model Date: 2022-09-15 Impact factor: 2.172

2. Application of Fiber Biochar-MOF Matrix Composites in Electrochemical Energy Storage.

Authors: Meixiang Gao; Meng Lu; Xia Zhang; Zhenhui Luo; Jiaqi Xiao
Journal: Polymers (Basel) Date: 2022-06-15 Impact factor: 4.967

Review 3. Metal-Organic Framework-Based Materials for Adsorption and Detection of Uranium(VI) from Aqueous Solution.

Authors: Hongjuan Liu; Tianyu Fu; Yuanbing Mao
Journal: ACS Omega Date: 2022-04-20

4. A novel 3D terbium metal-organic framework as a heterogeneous Lewis acid catalyst for the cyanosilylation of aldehyde.

Authors: Yuqian Liu; Peiran Zhao; Chunying Duan; Cheng He
Journal: RSC Adv Date: 2021-10-27 Impact factor: 4.036

5. Combining Machine Learning and Molecular Simulations to Unlock Gas Separation Potentials of MOF Membranes and MOF/Polymer MMMs.

Authors: Hilal Daglar; Seda Keskin
Journal: ACS Appl Mater Interfaces Date: 2022-07-11 Impact factor: 10.383

6. Deep Learning Models for Predicting Gas Adsorption Capacity of Nanomaterials.

Authors: Wenjing Guo; Jie Liu; Fan Dong; Ru Chen; Jayanti Das; Weigong Ge; Xiaoming Xu; Huixiao Hong
Journal: Nanomaterials (Basel) Date: 2022-09-27 Impact factor: 5.719

6 in total