Literature DB >> 31139725

A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction.

Zach Jensen¹, Edward Kim¹, Soonhyoung Kwon¹, Terry Z H Gani¹, Yuriy Román-Leshkov¹, Manuel Moliner², Avelino Corma², Elsa Olivetti¹.

Abstract

Zeolites are porous, aluminosilicate materials with many industrial and "green" applications. Despite their industrial relevance, many aspects of zeolite synthesis remain poorly understood requiring costly trial and error synthesis. In this paper, we create natural language processing techniques and text markup parsing tools to automatically extract synthesis information and trends from zeolite journal articles. We further engineer a data set of germanium-containing zeolites to test the accuracy of the extracted data and to discover potential opportunities for zeolites containing germanium. We also create a regression model for a zeolite's framework density from the synthesis conditions. This model has a cross-validated root mean squared error of 0.98 T/1000 Å3, and many of the model decision boundaries correspond to known synthesis heuristics in germanium-containing zeolites. We propose that this automatic data extraction can be applied to many different problems in zeolite synthesis and enable novel zeolite morphologies.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31139725 PMCID： PMC6535764 DOI： 10.1021/acscentsci.9b00193

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Zeolites are microporous, crystalline aluminosilicate materials with a wide range of applications including catalysis, adsorption, separation, and ion exchange.[1,2] Beyond their use as Brønsted acid catalysts in the chemical and petroleum industries,[2−4] zeolites have been utilized for several important environmental improvement and renewable energy applications including biomass conversion, CO2 capture, NO abatement, and water purification.[5] Notably, the topochemical features (i.e., pore structure, framework type, and heteroatom composition) often determine the performance of the zeolite.[6,7] As such, recent efforts in the community have focused on developing rational design strategies to engineer zeolites for targeted applications, such as designing a pore geometry that mimics the transition state of the specific reaction or crystallizing a framework with structural chirality.[8,9] Zeolite crystallization often occurs through a hydrothermal synthesis pathway governed by a large synthesis parameter space and complex crystallization kinetics that yield metastable structures.[10] In a typical zeolite synthesis, sources of SiO2, Al2O3, and a mineralizing agent (e.g., a source of OH– or F– anions) are mixed with water to form an aluminosilicate gel. In addition, inorganic cations or organic structure directing agent (OSDA) molecules are added to direct the formation of the zeolite structure. This gel is aged, reacted, and then crystallized under hydrothermal conditions. The composition of the gel, traditionally parametrized using molar ratios (e.g., OSDA/Si or H2O/Si), and the synthesis conditions determine the outcome of the crystallization process. Because of these complexities, zeolite synthesis–structure relationships are difficult to understand. Several studies have advanced this understanding;[11−14] however, global methodologies for predicting new zeolite structures from synthesis parameters are still limited. As a result, the synthesis of novel zeolite structures requires a semiempirical process governed mostly by domain heuristics acquired through experience. The lack of predictive ability to design synthesis routes for zeolites is a major bottleneck for discovering new zeolite structures.[15,16] Using first-principles approaches, researchers have estimated that several million unique zeolite structures are energetically favorable.[17−19] However, currently only 245 zeolites have been synthesized,[20] and far fewer are commercially available.[21] This presents a particular opportunity considering that the global market for zeolite-driven commercial processes exceeds 2 million metric tons per year.[22] This massive gap existing between theoretical and synthetically confirmed structures (also important in crystallization more generally) demonstrates the need for new, cutting-edge approaches to zeolite synthesis.[23] Data-driven synthesis approaches have found success in a number of domains, including organic[24−27] and inorganic[28−30] materials synthesis. These approaches have the potential to accelerate the development of new materials, as experts can learn new, complex relationships from existing data resources using visualization and automated data mining algorithms, as well as build fast predictive models coupled to experimental validation.[31] Along with the need for significant volumes of data, a critical aspect of accurate, data-driven models is the inclusion of negative examples,[32,33] for example, synthesis routes that did not yield the desired product. Unlike many other materials science domains, the zeolite community often includes failed syntheses (i.e., amorphous and dense phases) in publications and data sets, thus making the data-driven study of zeolites very promising. Several zeolite studies have found success using data science to predict the zeolite framework type from crystallographic data[34,35] and modeling the mechanical properties of zeolites.[36] However, only a handful of reports exist that successfully model relationships between synthesis parameters and the resulting structure.[37,38] These studies relied on high-throughput synthesis methods to generate data used to model synthesis parameters. Even with synthesis methods designed for rapid sample generation, each generated less than 150 synthesis routes, thereby limiting the analysis to only a subset of zeolite structures. Global data-driven zeolite synthesis approaches will require large amounts of data. Given that the field of zeolite synthesis has been very active in both the academic and industrial communities for more than 60 years, one rich source of abundant data is directly from scientific journal articles and patents.[39] However, it is impractical to manually extract data from more than a few hundred publications.[32,40,41] Automatic data extraction from materials science and chemistry text using natural language processing (NLP) techniques greatly increases the amount of available data.[42] Several NLP tools and software pipelines have been developed for automatic data extraction from scientific journal articles.[42−44] These pipelines have been used to extract material property and synthesis information from several different material domains including Curie and Néel temperatures for magnetic materials,[45] synthesis conditions of titania,[39,46] and screening of potential novel perovskite materials.[47] Indeed, this automatic extraction can be applied to capture all published, available zeolite synthesis data into a single data set allowing global comparisons between all types of zeolite structures, but the necessary tools to do so have not been developed. In this paper, we present an automatic data extraction pipeline to study the crystallization of zeolite structures and suggest ways in which machine learning (ML) can be used to predict synthesis pathways for new zeolite structures. We create tools to automatically extract zeolite synthesis and topology data from multiple locations within a journal article, including tables, captions, and footnotes along with body text, thus greatly extending our previous method developed for metal oxides.[42] We demonstrate the accuracy and usefulness of our extracted data by examining trends in both a global zeolite set and a focused subset comprising germanium-containing zeolites. The latter data set is used to elucidate specific synthesis trends, where a random forest regression model allowed prediction of the framework density of synthesized zeolites. This model moves toward predicting new zeolite topologies from synthesis data.

Results and Discussion

Zeolite Data Extraction

From our database of 2.5 million journal articles, we filtered down to a set of 70 000 papers relevant to zeolite synthesis through text matching specific zeolite keywords. The papers were processed through a pipeline that consists of extracting precursor information from the text of the paper with NLP algorithms (see Kim et al.[42] for additional information), applying HTML and XML parsing on the relevant synthesis tables, and using Regular Expressions (regex) to locate and extract compositional ratios. The extracted data were combined to reveal trends and train ML algorithms that could be used to gain insight into the effect of different synthesis variables. Figure presents a schematic representation of this data pipeline of extracting and combining zeolite synthesis data. The figure depicts our process to obtain information from multiple aspects of a journal article (including text and table data) and use these data to inform zeolite synthesis through prediction of a structural property such as the framework density, defined as the number of T atoms (Si, Al, Ge, etc.) per 1000 Å3, which is one of the simplest metrics used to distinguish zeo-type porous materials.

Figure 1

Schematic overview of zeolite data engineering including (1) literature extraction from sources such as NLP from body text, parsing of html tables, and regex matching between text and tables, (2) regression modeling, and (3) zeolite structure prediction. In a typical journal manuscript, zeolite synthesis information in the form of molar ratios and crystallization conditions is often scattered throughout tables, figures, and text within the main, supporting, and methods sections, each requiring a specialized extraction technique. Prior to our work, techniques capable of accurately extracting and correlating data from both tables and text in a journal article had not been developed. For tables, our software extracted information from HTML files, accounting for variation in both difference in HTML implementation and design of the actual table. Tables were converted into data mineable JSON file formats that are both human- and machine-readable. Next, we used our NLP pipeline to locate the target zeolite, type of OSDA, and missing crystallization conditions within the body text of the paper. Finally, the data were featurized into a fixed set of zeolite-relevant synthesis features (such as structural data from the International Zeolite Association (IZA) database) suitable for data analytics and ML (see methods section for extraction and data engineering details). Gel composition is a critical variable in determining the resulting zeolite topology for a synthesis route.[48,49] Using our table extraction software, we extracted gel composition data from the synthesis tables found in our set of 70 000 zeolite papers to identify trends among the synthesis variables and the products of zeolite synthesis. Figure shows these data plotted as pairwise relationships between several of the compositional features.

Figure 2

Pairwise plot of gel composition data automatically extracted from zeolite tables.

Pairwise plot of gel composition data automatically extracted from zeolite tables. Zeolites are traditionally synthesized with theoretical molar ratios of Si/Al > 1, OSDA/Si < 1, H2O/Si < 100, and F/Si < 1. However, the data extracted by our pipeline represented in Figure clearly show that these ranges can be exceeded. This effect is rationalized based on the specific conditions required for the synthesis of related zeotypes, such as silicoaluminophosphates (SAPOs) and Ge-rich silicogermanates.[50] The SAPO framework is formed by alternating tetrahedrally coordinated Al and P atoms, with a few of these heteroatoms isomorphically substituted by Si. Consequently, both SAPOs and Ge-rich molecular sieves have low Si contents, resulting in Si-normalized molar ratios beyond the classical values for typical high-silica zeolites. Although it is difficult to extract complex synthetic relationships and predictions from the simple compositional features shown in Figure , the data obtained from our pipeline can be used to validate general trends in zeolite synthesis. For example, a positive linear trend between the quantity of fluoride ions and OSDA molecules is observed (see bottom-left panel in Figure ). Fluoride is used as a mineralizer in zeolite synthesis,[51] often resulting in zeolites with a lower concentration of defects by providing more negative species to counterbalance the positive charge of the OSDA cations.[52] These fluoride-based routes are often performed close to neutral pH values, and, consequently, researchers tend to add similar molar amounts of fluoride and OSDA cations to the synthesis,[53−55] which is reflected in the trend we see in Figure . Taken together, these data show that the automated extraction algorithms are capable of isolating compositional information from the literature in a reliable fashion, thus allowing us to perform more in-depth analyses of the zeolite synthesis space (vide infra).

Analysis of Germanium-Containing Zeolites

Germanium addition into zeolite framework sites is responsible for the synthesis of many new zeolite structures over the past two decades.[56] Motivated by this success, we constructed a germanium-containing zeolite data set with our automated extraction pipeline. These data enabled us to verify the accuracy of our extracted data against known trends between synthesis variables and structures by providing a concise data set in a zeolite subdomain with a large amount of heuristic synthesis knowledge developed by the community. Besides verification of the data extraction, we also identified potentially interesting areas within the germanium zeolite system that can be explored further with ML and experimental techniques. Using germanium keyword text matching, we condensed our zeolite data into a set of 238 papers discussing the impact of germanium on zeolite synthesis. Using our automated data extraction pipeline and manually adding data from the supplemental sections of these papers, we created a data set of 1638 unique synthesis routes, an excerpt of which is shown in Table . Of these, 1214 synthesis routes successfully result in the creation of a zeolite or germanate, while the remainder result in either a dense crystal or amorphous material. The data contained compositional variables (i.e., Si, Ge, Al, B, alkali cations, H2O, F–, and OSDA amounts), conditional variables (e.g., crystallization time and temperature), the type of OSDA used in the synthesis, and the products formed all of which are extracted automatically and manually checked to ensure accuracy. The latter were featurized further with structural information extracted from the IZA Web site (e.g., framework density, secondary building units, and composite building blocks). Note that the OH–/Si molar ratio could be obtained by a simple postextraction data refining process.

Table 1

Excerpt of the Data Set of Germanium-Containing Zeolitesa

Si/Ge	Si/H₂O	Si/F^–	OSDA	product	reference
4	0.08	1.6	1,2-dimethyl-3-(3-methylbenzyl)imidazolium	CIT-13	(57)
30	0.19	1.9	hexamethonium	ITQ-13	(58)
2	0.67	2.7	benzyltriethylammonium	ITQ-44	(59)
1	0.1	1	1-methyl-3-(2′-methylbenzyl)imidazolium	NUD-2	(60)
7.5	0.13	1.76	pentamethyldiethylenetriamine	amorph	(61)

The full data set is available online (see Supporting Information).

The full data set is available online (see Supporting Information). Figure a shows the wide range of structural variability in Ge-containing zeolites with medium-, large-, and extra-large pore materials spanning framework densities from 7.5 to 19 T atoms/1000 Å3. Indeed, the inclusion of Ge, which is an element with a larger nonbonding radius compared to Si and capable of forming smaller OTO angles into the framework of silicates results in the stabilization of small-ring secondary building units (SBUs), including double four-membered rings (D4R), three-membered rings (3MR), and double three-membered rings (D3R).[16,62] The presence of these units gives rise to zeolite topologies with low tetrahedral site densities and large pores. While the use of Ge to stabilize small-ring SBUs is a known effect, the visual representation of all the data extracted with our pipeline gives rise to new insights and trends that were not previously clear. For example, extra-large pore structures are clustered in three areas corresponding to low, intermediate, and high framework densities (see purple triangles, yellow diamonds, and red squares, respectively, in Figure a). Further analysis revealed that materials with framework densities less than 10 T atoms/1000 Å3 correspond to pure nonzeolitic germanates (see germanates in Figure a), while materials with densities ranging between 11 and 14 T atoms/1000 Å3 correspond to topologies with some of the largest pores reported to date including ITQ-33 (18 MR × 12 MR × 12 MR)[16] and ITQ-44 (18 MR × 12 MR × 12 MR)[63] that have only been obtained with Si/Ge less than 4 (see ITQ-series in Figure a). Lastly, extra-large pore materials with narrow framework densities ranging from 15.5 to 16.5 T atoms/1000 Å3 correspond to crystalline structures, including UTL and CTH, where Ge is placed within the D4R units spacing the siliceous layers (see Assembly-Disassembly-Organization-Reassembly (ADOR)-precursors in Figure a).[57,64] This feature has been exploited to access new topologies by disassembling the interlayer Ge–O bonds and reorganizing into a new structure (i.e., the ADOR method).[65,66]

Figure 3

Germanium-containing zeolite data extracted with our pipeline. (a) Framework density clusters corresponding to different classes of germanium-containing zeolites. (b) Trade-off between Ge content and the amount of F– ions required to stabilize different zeolites. The three letter codes refer to specific zeolite framework structures defined by the IZA. ADOR is an interzeolite transformation synthesis method.[73] Figure b depicts the close relationship between Ge and fluoride (F–) ion contents. The stabilization of small-ring SBUs requires either the presence of Ge as a heteroatom with smaller OTO angles or F– as a small structure-directing agent that fits within the SBU.[67] Our data clearly reveal that there is a trade-off between Ge content and the amount of F– ions required to stabilize a particular structure in agreement with well-established synthesis tenets. Thus, zeolites containing large amounts of Ge can be synthesized with simple OSDAs and small amounts, or even in the absence, of F–, but these structures will not have high hydrothermal stability. For example, polymorph C of Beta (BEC) and IWR zeolites can be synthesized with Si/Ge ratios below 5 using simple OSDA molecules, such as, tetraethylammonium or hexamethonium, under F– free conditions.[68,69] In contrast, synthesizing more hydrothermally stable zeolites with the same topology that have less Ge content always requires the use of F– ions (see Figure b), in combination with more specific OSDAs, such as large organic molecules synthesized via the Diels–Alder cycloaddition of bulky addends.[70,71] Importantly, visualization of the data obtained with our extraction tool provides new insights by identifying areas of interest for future study. For example, Figure b reveals that there exist several cases for Ge-containing zeolites, including ITQ-22 (IWW, see Ge-IWW in Figure b), for which an OSDA has not been discovered to crystallize a Ge-free high-silica analogue.[72] We surmise that our data extraction tool combined with ML approaches will be essential to predict the required physicochemical properties to design such OSDAs. This is currently a main research topic in our laboratories.

Germanium Zeolite Framework Density Prediction

Finally, we combined our extracted data with ML algorithms to model the structural properties of a zeolite for a given set of synthesis parameters. While the previous examples verified our extracted data through simple trends, here we aimed to discover less intuitive, more complicated relationships between the synthesis parameters with the ultimate goal of potentially unearthing synthesis routes for new zeolite structures. Specifically, we modeled framework density as a regression problem using a random forest ensemble method (see Methods). In Figure a, we evaluated the fivefold cross validation accuracy of the model, where the color hue corresponds to the frequency of data points. The root mean squared error (RMSE) is 0.98 T/1000 Å3 compared with the standard deviation of framework density in our data, which is 1.76 T/1000 Å3. The RMSE and the r-squared values indicate our model begins to map synthesis conditions to the resulting structure’s framework density allowing predictions of synthesis conditions for novel zeolite with both high and low framework densities.

Figure 4

Random forest regression model predicting zeolite framework density from synthesis conditions. (a) Cross-validation results for the random forest model showing the actual experimental vs model predicted values for framework density. (b) A single decision tree regression model trained to predict framework density. Samples values correspond to the percentage of data passing through a node. Density refers to the average framework density value passing through each node. Vol SDA = the volume of the OSDA. Besides the ability to accurately map synthesis conditions to a zeolite’s framework density, an additional benefit of using decision trees to model zeolite synthesis is human interpretability. In Figure b, we compared a single decision tree machine learned regression model trained on the data to known synthesis pathways for zeolites with various framework densities. Following the different nodes of this decision tree, it is possible to predict the framework density of the potentially achieved zeolite depending on the synthesis parameters employed (lower framework densities are ordered toward the left side of the tree). The first nodes embrace the more influencing parameters on the target variable (in this case is the framework density of the zeolite). As seen in Figure b, the Si/Ge molar ratio, the H2O/T molar ratio, and the volume of the OSDA, in this particular order, are the more determinant variables to predict the zeolite framework densities of the Ge-containing zeolites. As a simple validation, we note that most of the Ge-containing zeolites featuring a very low framework density reported in the open literature require Si/Ge molar ratios of 1–2, very concentrated gels with H2O/T less than 5, and bulky OSDA molecules, all parameters that are in good agreement with the variables and their corresponding values presented in our decision tree.[74,75] While some of these heuristics might be evident to an expert in the field of zeolite synthesis, this example represents the first instance of a machine learned decision guideline for zeolites generated from automatically extracted literature synthesis data. The models in Figure demonstrate the potential of ML for predicting zeolite structural information from synthesis parameters. While not directly related to catalytic performance, predicting framework density represents an important step in tailoring synthesis conditions for zeolites. Combined with models for ring geometry and active-site chemistry, we will continue to progress toward predicting the synthesis conditions required to make new zeolites tailored for specific applications and find the synthesis conditions necessary to yield hypothetical zeolite structures.

Conclusion

We have developed an automatic data extraction pipeline that locates, extracts, and formats zeolite synthesis data from tables, ratios, and text. This pipeline is applied to the synthesis of germanium-containing zeolites to study the complex relationships between the synthesis parameters and resulting topology. Beyond looking at existing trends, we have demonstrated a machine learning model that predicts an important structural descriptor of a zeolite’s topology from the synthesis conditions. This model represents an important step toward using data to predict synthetic pathways for plausible zeolite structures that have not been crystallized yet. With relatively small changes in data engineering, this pipeline can be applied to other research questions in zeolite chemistry. The prevalence of unsuccessful synthesis routes provides an opportunity to model the success of potential zeolite synthesis routes. Future directions could also include more complicated models to study OSDA design, more complicated structure representations for new zeolite topology synthesis, or synthesis parameter optimization using active learning.

Experimental Section

Data Extraction

Tables

Tables from HTML and XML files were converted into hierarchical JSON structures (see Supporting Information for examples). Rule-based approaches based on the placements of number entries in a table determined the correct position of the column and row headers and, by elimination, any header nesting within the table. All words in the row and column headers were classified, and the orientation of the table was determined by the frequency of materials versus properties within the two headers. The extractor then constructed the correct relationship for each cell in the table. We also extracted the table caption and table footers. Any references in the table were linked to the corresponding footer entry as a dictionary key. We extracted full tables from ACS, APS, Elsevier, Wiley, RSC, and Springer. We were only able to extract table captions from Nature and AAAS due to tables being embedded within the paper HTML as external links.

Ratios

We used regular expressions to search the zeolite paper text for compositional ratios. Once the ratio was located in the text, we determined the type of numeric value associated with each compositional element: either a number, range, or variable. If the element was associated with a number, we assumed every data point extracted from the paper had that value. If the element value was a range, we assumed the range described many experiments detailed elsewhere in the paper. If the element value was a variable, we combined all other elements with matching variables to construct algebraic expressions. These expressions were necessary for correctly normalizing compositional information.

Text

Text information filled in gaps in synthesis conditions that existed after table extraction. We searched for crystallization operations by filtering operations by requiring both a time and temperature condition while excluding many incorrect operations such as mix, dry, calcine, and stir. The conditions associated with remaining operations were assumed to be the crystallization time and temperature for all data points associated with the syntheses extracted from the paper. We also searched the text for common OSDA names, again assuming the same OSDA applied to every syntheses.

Data Engineering

Composition

For the Ge data, the compositional features are the molar amounts of Si, Ge, Al, B, alkali cations, H2O, F, and OSDA. Raw extracted values needed to be engineered from their representation in their respective tables, to these standardized features. Other important compositional variables, such as the OH/Si molar ratio, can be achieved by a simple postextraction data refining considering the sources employed in the zeolite syntheses. Ratio values extracted from tables were split into the corresponding features. Next we solve the algebraic expressions extracted from ratios and normalize all species with the condition that Si = 1, unless Si = 0, in which case Ge = 1.

OSDA Featurization

All OSDAs were featurized using a multistep procedure starting with the conversion of the text form of each OSDA molecule into its SMILES representation using ChemSpider.[76] OSDA molecules represented by a non-IUPAC name or picture were manually assigned the correct IUPAC name and then converted to SMILES with ChemSpider. The Kier flexibility index and force field-optimized Cartesian coordinates were then obtained from a locally modified version of molSimplify.[77] Finally, ORCA 4.1[78] was used to calculate the volume, surface area, and dipole moment from the molSimplify-generated Cartesian coordinates. More details can be found in the Supporting Information.

Product Featurization

We featurized the products of the synthesis route with structural data from the IZA database. Zeolite materials were matched to the corresponding topology giving access to the framework density, ring configuration, and building units. Several nonzeolite germanate structures were also featurized with framework density and ring configuration provided by ITQ crystallographers.

Manual Data Supplementation and Cleaning

In addition to data extracted automatically, we manually extracted and engineered data from the supplementary sections of the Ge papers. These supplementary sections are highly unstructured PDF files, which prevents us from processing them with our automatic pipeline. After extraction and engineering, all the data were manually checked for inaccuracy, and any incorrect values were fixed.

Random Forest Model Architecture

We trained a random forest regression model using sci-kit learn,[79] a machine learning Python library. The ensemble consisted of 100 decision trees with splits determined by mean squared error. We trained and cross validated the model on syntheses that resulted in a pure phase zeolite or germanate, which includes 898 synthesis routes. We also created support vector regression, simple neural network, and Gaussian process regression models to compare with the random forest model. The random forest model was chosen, as it exhibited the highest accuracy compared to the other models while also having the benefit of human interpretability.

Decision Tree Model Architecture

We trained a single decision tree regression model using sci-kit learn.[79] Decision splits were determined by mean squared error. The model was trained on the 898 pure phase zeolite synthesis routes without cross validation, since we were only concerned with demonstrating machine learned synthesis intuition rather than any predictive ability with this model. The model was able to reproduce the framework density of the training data with an r-squared score of 0.97.

Safety Statement

No unexpected or unusually high safety hazards were encountered.

19 in total

Review 1. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning.

Authors: Kevin Maik Jablonka; Daniele Ongari; Seyed Mohamad Moosavi; Berend Smit
Journal: Chem Rev Date: 2020-06-10 Impact factor: 60.622

2. In situ imaging of two-dimensional surface growth reveals the prevalence and role of defects in zeolite crystallization.

Authors: Madhuresh K Choudhary; Rishabh Jain; Jeffrey D Rimer
Journal: Proc Natl Acad Sci U S A Date: 2020-10-30 Impact factor: 11.205

10. Machine-learning-accelerated multimodal characterization and multiobjective design optimization of natural porous materials.

Authors: Giulia Lo Dico; Álvaro Peña Nuñez; Verónica Carcelén; Maciej Haranczyk
Journal: Chem Sci Date: 2021-06-02 Impact factor: 9.825