Yongjin Lee1,2, Senja D Barthel1, Paweł Dłotko3, Seyed Mohamad Moosavi1, Kathryn Hess4, Berend Smit1,5. 1. Institut des Sciences et Ingéniere Chimiques, Valais , Ecole Polytechnique Fédérale de Lausanne (EPFL) , Rue de l'Industrie 17 , CH-1951 Sion , Switzerland. 2. School of Physical Science and Technology , ShanghaiTech University , Shanghai 201210 , China. 3. Department of Mathematics and Swansea Academy of Advanced Computing , Swansea University , Singleton Park, Swansea SA28PP , United Kingdom. 4. SV BMI UPHESS, Ecole Polytechnique Fédérale de Lausanne (EPFL) , CH-1015 Lausanne , Switzerland. 5. Department of Chemical and Biomolecular Engineering , University of California at Berkeley , Berkeley , California 94720 , United States.
Abstract
The materials genome initiative has led to the creation of a large (over a million) database of different classes of nanoporous materials. As the number of hypothetical materials that can, in principle, be experimentally synthesized is infinite, a bottleneck in the use of these databases for the discovery of novel materials is the lack of efficient computational tools to analyze them. Current approaches use brute-force molecular simulations to generate thermodynamic data needed to predict the performance of these materials in different applications, but this approach is limited to the analysis of tens of thousands of structures due to computational intractability. As such, it is conceivable and even likely that the best nanoporous materials for any given application have yet to be discovered both experimentally and theoretically. In this article, we seek a computational approach to tackle this issue by transitioning away from brute-force characterization to high-throughput screening methods based on big-data analysis, using the zeolite database as an example. For identifying and comparing zeolites, we used a topological data analysis-based descriptor (TD) recognizing pore shapes. For methane storage and carbon capture applications, our analyses seeking pairs of highly similar zeolites discovered good correlations between performance properties of a seed zeolite and the corresponding pair, which demonstrates the capability of TD to predict performance properties. It was also shown that when some top zeolites are known, TD can be used to detect other high-performing materials as their neighbors with high probability. Finally, we performed high-throughput screening of zeolites based on TD. For methane storage (or carbon capture) applications, the promising sets from our screenings contained high-percentages of top-performing zeolites: 45% (or 23%) of the top 1% zeolites in the entire set. This result shows that our screening approach using TD is highly efficient in finding high-performing materials. We expect that this approach could easily be extended to other applications by simply adjusting one parameter, the size of the target gas molecule.
The materials genome initiative has led to the creation of a large (over a million) database of different classes of nanoporous materials. As the number of hypothetical materials that can, in principle, be experimentally synthesized is infinite, a bottleneck in the use of these databases for the discovery of novel materials is the lack of efficient computational tools to analyze them. Current approaches use brute-force molecular simulations to generate thermodynamic data needed to predict the performance of these materials in different applications, but this approach is limited to the analysis of tens of thousands of structures due to computational intractability. As such, it is conceivable and even likely that the best nanoporous materials for any given application have yet to be discovered both experimentally and theoretically. In this article, we seek a computational approach to tackle this issue by transitioning away from brute-force characterization to high-throughput screening methods based on big-data analysis, using the zeolite database as an example. For identifying and comparing zeolites, we used a topological data analysis-based descriptor (TD) recognizing pore shapes. For methane storage and carbon capture applications, our analyses seeking pairs of highly similar zeolites discovered good correlations between performance properties of a seed zeolite and the corresponding pair, which demonstrates the capability of TD to predict performance properties. It was also shown that when some top zeolites are known, TD can be used to detect other high-performing materials as their neighbors with high probability. Finally, we performed high-throughput screening of zeolites based on TD. For methane storage (or carbon capture) applications, the promising sets from our screenings contained high-percentages of top-performing zeolites: 45% (or 23%) of the top 1% zeolites in the entire set. This result shows that our screening approach using TD is highly efficient in finding high-performing materials. We expect that this approach could easily be extended to other applications by simply adjusting one parameter, the size of the target gas molecule.
Zeolites, metal–organic frameworks,[1] and other related nanoporous materials[2] have many interesting applications, ranging from gas storage and
separations to catalysis, sensing, etc. Scientific interest in these
materials is related to their chemical tunability; by combining different
organic linkers and metal units, we could synthesize millions of different
nanoporous materials. These materials therefore provide an ideal platform
to develop a profound understanding of how to tailor-make a material
that is optimal for a given application. A practical limitation to
developing this understanding is that in reality one can synthesize
only a small fraction of all possible materials. Computational approaches
have therefore been developed to generate libraries of millions of
predicted nanoporous materials. To coordinate this development, the
White House launched in 2011 the Materials Genome Initiative,[3] which has generated significant scientific advances
in the field of computational materials discovery. Specifically, for
the development of advanced nanoporous materials, this initiative
has led to the creation of a large database (the so-called ‘Nanoporous
Materials Genome’) of different classes of porous materials
(an infinite number of materials, in principle) that could be synthesized
by combining different molecular building blocks.[4−11]The current computational approach uses screening based on
brute-force molecular simulations to generate the thermodynamic data
needed to predict the performance of these materials in applications
such as methane storage[12,13] and different types
of gas separation,[14,15] but this approach is limited
to tens of thousands of structures, due to computing time constraints.
The main disadvantage of these brute-force techniques is that they
can be relatively expensive. As the size of the libraries is growing
exponentially, alternative screening methodologies to screen these
databases are needed.One popular screening methodology is utilizing
simple descriptors that characterize materials. The idea behind these
descriptors is that materials with similar descriptors should perform
similarly. In the case of nanoporous materials, a fundamental question
in developing a descriptor is how to systematically characterize similarity
of pore structures. For nanoporous materials, popular descriptors
are, for example, pore volume, density of the material, surface area,
maximum included sphere, etc. These descriptors can be computed very
efficiently and can subsequently be used to correlate with the performance
of a material,[16,17] but unfortunately remain insufficient
to find the best materials.Recently, we developed a new descriptor
for nanoporous materials by taking a fundamentally different route
and exploring topological concepts to quantify similarity of pore
structures.[18] Describing the complete pore
topology of a material requires extremely high-dimensional data, which
exceeds the capacity of most conventional data-mining tools. Therefore,
in order to analyze high-dimensional data of pore structures, we employed
the topological data analysis (TDA),[19,20] which is a
newly developed data-mining tool that has been successfully used to
investigate various problems related to big data over the past decade.Topology is the branch of mathematics concerned with the global
structure of shape. TDA studies the “shape” of big and
high-dimensional data in order to discover meaningful structure in
the data and to identify important subgroups. Over the past decade,
TDA has been intensively applied to investigate various problems involving
large and high-dimensional data sets.[19] One remarkably successful application of TDA was the discovery of
a new type of breast cancer based on gene expression data of patients.
TDA enabled the investigators to identify a previously unknown subtype
of breast cancer with a unique mutational profile and excellent survival.[21] Moreover, recently, TDA has extended its range
of application to identification and characterization in materials
science.[22−24]In this article we aim to develop a high-throughput
screening approach for nanoporous materials genome using our TDA-based
descriptor. In our research, we chose zeolites as a starting material
for two important reasons: (1) We already possess over a 100,000 simulation
results for zeolites for several important applications, such as methane
storage[25] and carbon capture,[14] which can serve as a reliable reference set;
(2) as these materials are in the all-silica form, they all have the
same chemical composition and differ only in their pore topology.
This allows us to fundamentally check the validity of a pore-topology-based
descriptor.We recall first this new descriptor based on persistent
homology theory.[26] Next, we test the capability
of our new descriptor to predict performance properties of zeolites
for methane storage and carbon capture applications, in comparison
with predictions from conventional descriptors. Furthermore, we perform
high-throughput screening of zeolites for methane storage and carbon
capture applications and show that our screening approach is highly
efficient in finding high-performing materials.
Materials
and Theory
Zeolite Database
For constructing
the zeolite database in this study, we collected performance properties
of zeolites from available sources, and the corresponding structures
from the International Zeolite Association (IZA)[27] database and Deem’s hypothetical database.[7] For methane storage and carbon capture applications,
deliverable capacities of 139,407 zeolites from Simon et al.’s
work[25] and parasitic energies of 119,129
zeolites from Lin et al.’s study[14] were adopted, respectively. Deliverable capacity is a measure of
the energy density of the material, defined as the difference in loading
(number of methane molecules per unit material) at the (high) pressure
at which we charge the materials with methane and at the (low) pressure
at which we discharge the material. Parasitic energy is the total
loss of electricity production if a carbon capture-and-sequestration
process is added to a coal-fired power plant. For the best material,
the loss of electricity production is minimal.
Persistent
Homology
Persistent homology[26] is a sophisticated topological methodology for identifying important
features of a point cloud that persist over a range of spatial resolutions,
as opposed to noise which persists only through a limited range of
spatial resolutions. Persistent homology enables a multiscale analysis
geometric features of point clouds. From a point cloud, a filtration
of simplicial complexes is constructed, that is, a nested sequence
of geometric objects that are described by gluing points, line segments,
triangles, tetrahedra, etc., along their faces. Persistent homology
detects homological changes of the complexes as the filtration increases.
The persistent homology algorithm captures the birth and death times
of homology classes, where birth means the creation of a nonzero homology
class, while death refers to the merging of a homology class with
another class born earlier. Homology classes detect the following
types of features: zero-dimensional homology classes correspond to
connected components, one-dimensional homology detects circles, and
two-dimensional homology classes correspond to voids, and so on.[28] The lifetime of a class is the difference between
its death and birth times. Homology classes with relatively long lifetimes
provide important information about the global shape of the point
cloud, whereas noise generates short-lived homology classes.
Generation of a Descriptor Based on Pore Topology
The
procedures to generate a descriptor based on topology of pore structures
are illustrated in Figure . The first step is to prepare a finite set of points that
represents a pore structure. In order to identify pore structures
inside zeolites, we used the open-source software Zeo++,[16] which models the accessible void space inside
a porous material with a periodic Voronoi network. In Zeo++, both
the framework atoms and gas molecules are modeled as hard spheres
with radii adopted from the Cambridge Structural Database.[29,30] Pore structure is a continuous object consisting of an infinite
number of points. Thus, it is important to have sufficiently high
resolution in order to capture pore structures well with only a finite
number of points. However, it is also desirable that the resolution
not be too high, as computational cost increases with the resolution
(i.e., the number of points). One way to adjust the resolution is
to manipulate a minimum distance (rmin) between two different points sampled. Through careful investigation
from 1.2 to 0.7 Å, we set 0.8 Å as rmin in the work presented here.
Figure 1
Persistent homology:
the procedure to obtain a descriptor based on pore topologies in a
zeolite.
Persistent homology:
the procedure to obtain a descriptor based on pore topologies in a
zeolite.The second step is to capture
topological features of pore structures by performing persistent homology
analysis for the sets of points prepared in the first step and encoding
information about pore structure in a form of barcode. For persistent
homology analysis, we used the Perseus software.[31] The analysis was executed by constructing Vietoris–Rips
complexes up through dimension 3, increasing the persistence interval
ε by steps of 0.025 Å from the initial value of 0 for each
point. The maximum distance considered, εmax, was
set to 4.1 Å, in order to avoid identifying parts of the zeolite
as pores (see Supporting Figure S1). For
each zeolite, persistent homology analysis was performed separately
for each individual connected component of a pore, and the outputs
were combined thereafter. We proceeded in this manner because when
the smallest distance between neighboring pores is smaller than εmax, they become connected, and artificial pores are created
during the filtration process (see Figure S2).The output of persistent homology analyses is given in the
form of barcodes (or, equivalently, persistence diagrams), which play
the role of a descriptor (i.e., fingerprint) for identifying and comparing
zeolites. Although we generated barcodes just for two kinds of gas
(carbon dioxide and methane), it is worthwhile to note that these
procedures are universal. In developing barcodes for other gases,
only the size of the probe gas molecule needs to be adjusted to reflect
variations of accessible pore space, which makes this approach easily
applicable to various kinds of applications.
Measuring
Similarity between Different Barcodes
Comparing barcodes
for different structures requires a measure of similarity or dissimilarity
between them. There are a number of ways to define a measure on the
space of persistence diagrams. Among them, we used L2-
distances between the persistence landscapes; see ref (18). for details. L2-distances were calculated using the Persistence Landscape Toolbox,[32] after constructing persistence landscapes from
the barcodes we obtained. For each structure, we used barcodes in
dimensions 0, 1, and 2. Because each dimension matters for identifying
pore structures, we first calculated distances for each dimension
and combined them as a root-mean-square for the overall distance d between different barcodes
(or zeolites) i and j:where
α0, α1, and α2 are
weight factors for each dimensional distance, respectively. Λ1 and Λ2 are L2-distances calculated
at dimension 1 and 2. In dimension 0, instead of L2-distance,
we used the Euclidean distance , where n (or n) and V (or V) are the
number of sampled points and the volume of zeolite i (or j), which is the only relevant information
from the zero-dimensional barcode, as the lengths of the persistence
intervals of a 0-simplex (i.e., lifetime of connected components)
is an artifact of the sampling procedure rather than intrinsic to
the material. Because we have not made explicit how the overall distance
is related to the distances in each dimension, the effects of functional
forms’ type and different weight factors are discussed in Section .
Implementation Details
Unit-Cell Size Issue
In comparing shapes of different pore structures, it is reasonable
to compare pores obtained from zeolites having the same or almost
equivalent volume, in order to avoid possible errors due to volume
differences. However, as shown in Figure S3, the distribution of unit-cell volumes of zeolites covers a wide
range, from 290.855 to 42282.4 Å3 with an average
of 3483.4 Å3 and standard deviation of 1837.792 Å3. To minimize volume differences, in our analysis we used
supercells with periodic boundary conditions created by expanding
each unit cell repeatedly with a target volume of 40000 Å3, which is close to the largest volume of 42282.4 Å3 in the entire set of zeolites. Numbers of repetitions along
three axes were chosen to make expanded cells as isotropic as possible.
As shown in Figure S3, the distribution
of volumes of expanded cells became narrower, compared to that of
original unit cells, with an average of 40013.22 Å3 and standard deviation of 5873.805 Å3.
Correction of Death Time for Unclosed Second Dimensional Homology
Class
When generating barcodes using persistent homology
analysis, we set the maximum persistent interval εmax to be 4.1 Å, to avoid detecting parts of the zeolite as pores.
However, this εmax is not sufficiently large for
all homology classes to be dead at the end of the filtration, especially
for zeolites having large pores. If homology classes are still alive
at the end of the persistent homology analysis, barcodes corresponding
to them are not included in calculating L2-distances, although
they represent important topological features (such as large pores)
with long intervals. This might cause undesired errors in comparing
structures using barcodes. Figure S4 shows
scatter plots of performance parameters of the entire zeolite database
as a function of distances d for two example zeolites PCOD8330975 and PCOD8325951. Both
are high-performing zeolites for methane storage, having deliverable
capacity of 137.94 (PCOD8330975) or 97.6248 (PCOD8325951) v STP/v.
Thus, in our high-throughput screening (see Section for details), structures predicted to
be similar to PCOD8330975 or PCOD8325951 are categorized as promising
structures. A gauge distance determining whether two different structures
are similar or not normally occurs around d = 0.05. For PCOD8330975, as shown in Figure S4a, there are a reasonable number of
zeolites within the gauge distance, which is a standard distribution
of neighboring zeolites. As PCOD8330975 has relatively small pores
(the diameter of the largest included sphere D = 4.638 Å), for this material all second
dimensional homology classes are dead. In contrast, PCOD8325951 has
large D = 14.739 Å,
so that second dimensional homology classes corresponding to the largest
pore do not die by the time we reach εmax in our
persistent homology analysis. Because of this missing information
about large pores, Λ2 between PCOD8325951 and other
zeolites is estimated incorrectly to be small regardless of similarities
of pore shapes; as shown in Figure S4b,
there is a large population of zeolites within a distance 0.05 from
PCOD8325951. Most of them are categorized into two cases: (1) zeolites
with no pocket inside and (2) zeolites with a large pocket. If such
a structure is included in the initial training set for a screening
study, it causes many dissimilar structures to be incorrectly assigned
as neighbors and categorized as promising structures, which can lower
performance of screening significantly.Thus, to avoid this
unphysical neighboring and compare two barcodes correctly, a death
time has to be assigned to those homology classes that are remaining
at the end of the filtration. One possible way is increasing εmax until all second dimensional homology classes are trivial.
However, increasing εmax might lead to wrongly detecting
parts of the zeolite as pores, as explained in Section . Instead, we assigned a
death time for such second dimensional homology classes about pores
by an extrapolation approach based on the relation between D and death time for small
and midrange pores, because death time of second dimensional homology
classes is closely related to the size of pockets inside a pore structure.
As shown in Figure S5, we could obtain
linearly fitted behavior in a two-dimensional histogram of death time
against D for both CH4 and CO2. As shown in Figure S6, when the death time was assigned using the extrapolation
approach, dissimilar zeolites initially located close were shifted
right, and the histogram for PCOD8325951 took on a standard shape.
Dimensionality and Weight
In our study
measures of distance (similarity) are estimated in three dimensions.
These measures need to be combined in one overall measure of similarity,
to utilize important information about pore structures identified
at each dimension: connected components as zero-dimensional classes,
tunnels as one-dimensional classes, and voids as two-dimensional classes.
To determine the optimal way to define the overall distance, we prepared
a subset of zeolites by randomly selecting 5000 structures from the
entire zeolite database for the methane storage application and investigated
correlations between the performance parameter (PP) of each zeolite (Z) in the subset and the performance
parameter (PP) of the corresponding most similar structure (sZ) in the entire database. Four different
kinds of functional forms were examined with varying weight factors
in each dimension: arithmetic mean (AM), geometric mean (GM), harmonic
mean (HM), and root-mean-square (QM).Figure shows two-dimensional scatterplots of PPs
of zeolites in the subset against those of the corresponding most
similar ones. In the ideal case, as the structure sZ is most similar to structure Z, their performance parameters
should be very similar, that is, PP ≈ PP. Figure shows that indeed, irrespective of the types
of functional forms and the range of weight factors used, PP is similar to PP. The arithmetic (AM) and root-mean-square (QM) forms gave
the best results with weight factors of 0.1, 0.45, and 0.45 in dimensions
0, 1, and 2, respectively, with root-mean-square error (RMSE) = 6.64
for AM and 6.60 for QM. The RMSE was calculated as , where nsubset is the number of zeolites
in the subset. Based on these results, we used in the rest of this
article the QM with α0 = 0.1, α1 = 0.45, and α2 = 0.45 as a measure of distance
between different barcodes (or persistence diagrams). At this point
it is important to note that the optimal measure of distance may depend
on the performance property one is interested in.
Figure 2
Two-dimensional scatterplots
of the performance parameters PP of zeolites in a fixed subset of 5000 materials
against those PP of the corresponding most similar ones by
TDA, where AM, GM, HM, or QM denote arithmetic, geometric, harmonic,
or root-mean-square, respectively. The three numbers following a type
of mean are the chosen weight factors for dimensions 0, 1, and 2.
In these graphs the performance parameter is the one we use for methane
storage.
Two-dimensional scatterplots
of the performance parameters PP of zeolites in a fixed subset of 5000 materials
against those PP of the corresponding most similar ones by
TDA, where AM, GM, HM, or QM denote arithmetic, geometric, harmonic,
or root-mean-square, respectively. The three numbers following a type
of mean are the chosen weight factors for dimensions 0, 1, and 2.
In these graphs the performance parameter is the one we use for methane
storage.
Results
and Discussion
TDA-Based Description of
the Performance Parameters
Before applying a TDA-based descriptor
(hereinafter referred as “TD (topological data analysis-based
descriptor)”) to screening zeolites, we checked its capability
to predict performance properties. For each zeolite in the subsets
prepared by randomly selecting zeolites from the entire set, we found
the most similar zeolites in the entire set and compared performance
properties between them. For comparison, we also performed the analyses
for the subsets using each individual conventional descriptor and
an aggregation (CD) of five conventional descriptors as CD = {D, D, ρ, ASA, AV}, where D, and D represents the diameter of the largest included sphere and
of the free sphere, ρ is zeolite density, and ASA and AV denote
the accessible surface area and volume to a gas probe molecule. All
of these properties were calculated using Zeo++. For both individual
conventional descriptors and CD, distances between different structures
were measured with the normalized L2 Euclidean distance
between the vectors.Figure shows two-dimensional scatterplots of PPs of zeolites
in the subset of 5000 for the methane storage application. As shown
in Figure , selection
by individual conventional descriptors did not lead to good correlation
between PP and PP. The RMSE values (22, 24, 19, 20, and 18 for D, D, ρ, ASA, and AV, respectively) were significantly larger
than RMSE for TD. As one might expect, the aggregate of these descriptors
(CD) showed much improved correlation (with RMSE = 11.341), as in
the aggregate, there is a compensation effect due to combining information
about the pore structure contained in each individual descriptor. Figure does show, however,
that the overall performance of TD is significantly better than the
aggregate of CD (RMSE = 6.60 for TD and 11.34 for CD).
Figure 3
Two-dimensional scatterplots
of performance property PP of zeolites in the subset against those PP of the
corresponding most similar ones, for the methane storage application.
Red dots indicate results by TD. Green dots are results by conventional
descriptors. Blue diagonal lines correspond to {PP} = {PP}. The
RMSE = 6.60 by TD; 21.67, 23.56, 19.10, 19.53, 18.38, and 11.34 by D, D, ρ, ASA, AV, and CD, respectively.
Two-dimensional scatterplots
of performance property PP of zeolites in the subset against those PP of the
corresponding most similar ones, for the methane storage application.
Red dots indicate results by TD. Green dots are results by conventional
descriptors. Blue diagonal lines correspond to {PP} = {PP}. The
RMSE = 6.60 by TD; 21.67, 23.56, 19.10, 19.53, 18.38, and 11.34 by D, D, ρ, ASA, AV, and CD, respectively.It might be interesting to investigate in detail
some of the structures for which there are large discrepancies between
performance properties from CD and TD predictions. For instance, in Figure we compare the zeolites
PCOD8097838 and PCOD8004291 that were selected to be most similar
to PCOD8165978 by CD and TD, respectively. Globally, the pore shapes
for these three structures have similar one-dimensional linear shapes.
However, in detail, as opposed to the pore shape of PCOD8097838 (prediction
by CD), PCOD8004291 (prediction by TD) shows a zigzag patterned pore
shape similar to PCOD8165978. As shown in the table in Figure , although PCOD8165978 and
PCOD8097838 have very similar values for the five structural properties,
the CD might not capture the details in pore shape that could result
in significantly different performances between PCOD8165978 (PP =
26 v STP/v) and PCOD8097838 (PP = 93 v STP/v); note that PCOD8004291
exhibits PP = 28 v STP/v.
Figure 4
An example of the most similar structures selected
by TD (right) and CD (center) based on a seed zeolite (left). Red
or tan colored spheres represent oxygen or silicon atoms in a zeolite,
respectively. Blue colored spaces correspond to pore structures.
An example of the most similar structures selected
by TD (right) and CD (center) based on a seed zeolite (left). Red
or tan colored spheres represent oxygen or silicon atoms in a zeolite,
respectively. Blue colored spaces correspond to pore structures.We also applied our methodology
to the carbon capture application, where we used an inverse of parasitic
energy as a performance property. In Figure , the screening result using TD is compared
with screening by five different single descriptors and their aggregate
(CD). For the carbon capture application, compared to single descriptors
and CD, TD also yielded much improved correlation (RMSE = 1.87 ×
10–4 for TD with 2.75 × 10–4 for CD).
Figure 5
Two-dimensional histograms of performance property PP of zeolites in the subset
against those PP of the corresponding most similar ones, for
the carbon capture application. Red dots indicate results by TD. Green
dots are results by conventional descriptors. Blue diagonal lines
correspond to {PP} = {PP}. The RMSE = 1.87 × 10–4 by TD; 3.56 × 10–4, 3.43 × 10–4, 3.71 × 10–4, 3.73 × 10–4, 3.45 × 10–4, and 2.75 × 10–4 by D, D, ρ, ASA, AV, and CD, respectively.
Two-dimensional histograms of performance property PP of zeolites in the subset
against those PP of the corresponding most similar ones, for
the carbon capture application. Red dots indicate results by TD. Green
dots are results by conventional descriptors. Blue diagonal lines
correspond to {PP} = {PP}. The RMSE = 1.87 × 10–4 by TD; 3.56 × 10–4, 3.43 × 10–4, 3.71 × 10–4, 3.73 × 10–4, 3.45 × 10–4, and 2.75 × 10–4 by D, D, ρ, ASA, AV, and CD, respectively.
The Capability
of the TDA-Based Descriptor To Find the Top-Performing Zeolites
Next, we checked the capability of TD to detect high-performing
structures in the entire database, given that we know the structure
of several top-performing materials. The idea is that our method will
provide all the materials that are topologically similar to these
top-performing materials. If our hypothesis is correct, then most
of these similar materials should also be top-performing materials.For this analysis, we first defined the set of the top 100 materials
(“the best set”) out of the entire database, according
to their PP. For each structure in the best set, we found the five
materials that are closest to it, based on distances measured using
TD. The capability of a descriptor was measured as the probability
that the selected similar structures have PP larger than a threshold
value, which is set as the PP value of the top 1% of zeolites for
each application.For the methane storage application, we set
a threshold PP as deliverable capacity = 90 v STP/v, since the total
number of structures having deliverable capacity larger than 90 v
STP/v is about 1% of the entire database. As summarized in Table (a), TDs were highly
capable of detecting high-performing materials in the entire database,
as long as some top materials are already known. For instance, with
TD, it was possible to have another good material as the first nearest
neighbor with 79.3% probability, which is comparable to 82.7% by CD.
The average error for PPs between zeolites in the best set and the
corresponding five closest neighbors is 17.03% or 15.42% by TD and
CD, respectively. These results seem to state that CD performs better
than TD in terms of the probability of finding other top materials
as neighbors of given top zeolite. However, we would like to emphasize
that more top materials can be detected using TD overall. In the analysis
using CD, there were many overlaps among top materials found as neighbors
of different zeolites in the best set, which might indicate better
capability to detect more diverse top-performing zeolites by TD.
Table 1
Probability of Finding Top 1% Materials within The Nth Nearest Neighbors of Top 100 Zeolites for
(a) the Methane Storage and (b) the Carbon Capture Applicationsa
Nth
TD
CD
(a) Methane Storage Application
1
0.793 (69/87)
0.827 (72/87)
3
0.679 (163/240)
0.769 (157/204)
5
0.701 (265/378)
0.754 (236/313)
(b) Carbon Capture Application
1
0.170 (17/100)
0.222 (22/99)
3
0.138 (40/289)
0.197 (56/284)
5
0.129 (61/472)
0.160 (73/456)
Repeat appearances of the same zeolite are excluded.
Repeat appearances of the same zeolite are excluded.For the carbon capture application,
we set a threshold PP as 1/parasitic energy = 0.001282 kg CO2/kJ (corresponding to parasitic energy = 780 kJ/kg CO2) because about 1% of the entire set of structures has parasitic
energy lower than 780 kJ/kg CO2. As shown in Table (b), with TD, 17 top 1% structures
were detected as the first nearest neighbors, which corresponds to
17% probability. We could find 22 top 1% structures using CD. Compared
to the methane storage application, while the probability finding
another top material is lower, PPs between zeolites in the best set
and the corresponding five closest neighbors showed better agreement,
as the average errors were 12.83% and 12.16% for TD and CD, respectively.It is instructive to discuss why we think our predictions for carbon
capture are not as successful as for methane storage. The objective
function for methane storage is the deliverable capacity at a single
temperature. For carbon capture the parasitic heat is much more complex
as it compares the trade-off between compression and heating, which
therefore requires a prediction of not only the deliverable capacity
at different temperatures but also the heat of adsorption at different
temperatures. Based on these consideration, it is not surprising to
see that with the same level of detail in our fingerprint, one would
expect our method to work better for methane storage.The good
agreement between PP of structures from the out-of-bag search and
initial structure indicates that TD can reasonably predict PP without
performing molecular simulations for all structures.
High-Throughput Screening Using TDA-Based Descriptor
Next, we applied TD to high-throughput screening of zeolite database.
The workflow of our screening method is illustrated in Figure . It consists of the following
six steps. First, we performed persistent homology analysis for all
structures in the entire set and obtained barcodes that work as fingerprints
for pore topologies. Second, a training set was selected with the
min–max algorithm,[33] which is a
diversity selection approach to ensure that our training set of materials
sufficiently covers the entire space based on persistent homology.
The number of structures in the training set depends on how diverse
the database is in terms of pore topologies, as analyzed by persistent
homology. For each screening, we increased the number of structures
in the training set until the diversity of the set was sufficiently
saturated. The degree of diversity saturation was measured by the
change of minimum distances upon adding a new structure. The convergence
criterion was set as (1 – {minimum distance of a new structure}/{average
minimum distance of previous 10 structures}) < 0.001 (see Figure S7). In our work, as a training set, 1500
zeolites were chosen. Third, we ranked structures in the training
set according to their performance-related properties: deliverable
capacity for methane storage and inverse of parasitic energy for carbon
capture. Fourth, we performed a screening on the entire set of the
structures except those included in the training set. For each structure
in the training set, we created a bin containing similar structures,
as follows. For any material not in the training set, we computed
all pairwise similarities between it and the training set materials
and then assigned it to the bin of the material in the training set
to which it is most similar, based on the metric defined in the previous
section. We expected the PP of each material to be similar to the
PP of the material in the training set corresponding to its bin. Fifth,
after screening all materials not in the training set, we defined
the most promising set of materials (e.g., top 1%, top 0.5%, ...)
to be those materials that were assigned to the bins corresponding
to materials in the training set with PP larger than the criterium
we specified. Lastly, to verify our results, we compared the PP of
materials in the most promising set as obtained from the grand-canonical
Monte Carlo simulations to the hypothetical PP coming from their bin
assignment.
Figure 6
Procedure of high-throughput screening using the TDA-based descriptor.
Procedure of high-throughput screening using the TDA-based descriptor.First, we performed a high-throughput
screening using TD for methane application. For comparison, we also
performed a screening using aggregation of conventional descriptors
(CD). Figure a,b shows
the normalized distribution of diverse training sets and of promising
sets predicted by TD and CD for methane storage. For the sake of reference,
the distributions of the entire set and random training set are also
shown. Normalization was done with respect to the total number of
structures in each set. As shown in Figure a, the modes of the distribution of PP for
both diverse training sets are significantly shifted to higher PP
compared to mode for the random training set; note that the highest
peak occurs around PP = 80. The large population of zeolites in the
range of PP between 50 and 90 might reflect high diversity of zeolites
in that region.
Figure 7
(a) The normalized distribution of diverse training sets
by TD and CD, together with that of a random training set. (b) The
normalized distribution of promising sets by TD and CD, compared to
that of the entire set. The x-axis represents the
PP, which is deliverable capacity for the methane storage application.
(a) The normalized distribution of diverse training sets
by TD and CD, together with that of a random training set. (b) The
normalized distribution of promising sets by TD and CD, compared to
that of the entire set. The x-axis represents the
PP, which is deliverable capacity for the methane storage application.As shown in Figure b, our screening strategy efficiently detected
high-performing materials based on the comparison between the distribution
of PP in the promising set and that in the entire set. The distribution
of PP for the promising set is significantly shifted to high PP compared
to the entire set, confirming the efficiency of our screening strategy
using TD. Also, TD worked well for screening out low-performing materials
with PP less than about 40 v STP/v, which is important to ensure that
low-performing materials are not labeled as promising materials. Our
results show that TD and CD have similar modes of distribution of
PP for the corresponding promising set. While it seems that CD produces
more good structures in the promising set than TD, the picture is
somewhat different if we look at the percentage of top-performing
structures in the promising set as a fraction of the entire set, because
the normalized frequencies in Figure b show only the relative number of structures within
the promising set. Table (a) shows percentages of the number of structures having PP
> 90 v STP/v in the promising set based on the number of structures
having PP > 90 in the entire set. As summarized in Table (a), the promising set determined
by TD contained higher percentages of top-performing structures than
CD: 45.16% top 1% zeolites, which is significantly higher than 32.31%
by CD. Moreover, TD (respectively, CD) produced 61.1 (72.2), 72.2
(60.6), 59.8 (43.5), 55.6 (39.8), or 39.3 (27.2)% of structures having
PP > 130, 130 ≥ PP > 120, 120 ≥ PP > 110, 110
≥ PP > 100, or 100 ≥ PP > 90 v STP/v, respectively.
Table 2
Percentage of Top 1% Materials Detected in the Promising
Sets by TD and CD for (a) the Methane Storage and (b) the Carbon Capture
Applicationsa
(a) methane
store application
PP
TD
CD
>130
61.1%
72.2
130–120
72.2
60.6
120–110
59.8
43.5
110–100
55.6
39.8
100–90
39.3
27.2
total
45.16
32.31
The last row
in each table (i.e., total) shows overall percentage of top 1% materials
detected in the promising sets.
The last row
in each table (i.e., total) shows overall percentage of top 1% materials
detected in the promising sets.Next, the screening results for the carbon capture application
are shown in Figure . The overall results are similar to the case of methane storage;
the ordering of the modes of distribution of promising set is TD >
CD > random selection. However, it is worthwhile to note that the
discrepancy between TD and CD in ability to screen zeolites is larger
than that for methane storage. From Table (b), we observe that the promising set of
2105 (or 1839) structures created by TD (respectively, CD) contains
23.8 (15.4), 22.2 (10.9), 22.1 (10.8), 24.6 (4.5), or 21.7 (5.3) %
of structures having PE > 740, 750 ≥ PE > 740, 760 ≥
PE > 750, 770 ≥ PE > 760, or 780 ≥ PE > 770
kJ/kg CO2, respectively.
Figure 8
(a) The normalized distribution of a given
performance parameter for carbon capture for diverse training sets
created by TD and CD, together with that of a random training set.
(b) The normalized distribution a given performance parameter for
carbon capture for promising sets created by TD and CD, compared to
that of the entire set. The x-axis represents the
PP, which is deliverable capacity for the carbon capture application.
(a) The normalized distribution of a given
performance parameter for carbon capture for diverse training sets
created by TD and CD, together with that of a random training set.
(b) The normalized distribution a given performance parameter for
carbon capture for promising sets created by TD and CD, compared to
that of the entire set. The x-axis represents the
PP, which is deliverable capacity for the carbon capture application.
Conclusion
In this article, we developed a high-throughput approach for screening
zeolites, using a recently developed topological data analysis-based
descriptor (TD) that recognizes pore topology. For generating this
descriptor, a point-cloud representation of pore structures was created
using Zeo++, and topological features of the pore structures were
then encoded in the form of barcodes, by performing persistent homology
analysis for the point cloud. To build filtrations for persistent
homology, we used the Vietoris–Rips complex, but our method
could also be applied to cubical complexes and alpha complexes, as
we will do in forthcoming work.We first checked the capability
of this descriptor to predict performance properties of zeolites for
methane storage (deliverable capacity) and carbon capture (inverse
of parasitic energy), in comparison with predictions from conventional
descriptors. In global searches for the most similar structures to
a selected subset, the overall performance of TD is significantly
better than that of the aggregate of the conventional descriptors
(CD); root-mean-square errors of performance properties between the
initial subset and the most similar set were estimated to be 6.60
v STP/v by TD and 11.34 v STP/v by CD for methane storage and 1.87 ×
10–4 kg CO2/kJ by TD and 2.75 ×
10–4 kg CO2/kJ by CD for carbon capture
applications. Furthermore, we showed that TD is highly capable of
detecting good materials in the entire set, as long as some top materials
are already known. Next, with confidence in the capability of TD to
predict performance properties without performing molecular simulations
for all structures and to match top-performing materials, we performed
high-throughput screening of zeolites for methane storage and carbon
capture applications. We showed that the TD screening approach is
highly efficient in detecting high-performing materials for both applications;
the promising set created by TD contained higher percentages of top-performing
structures than that obtained by CD.Although the TD has been
tested only for carbon capture and methane storage application in
only one kind of framework (zeolites), we expect that our methodology
can easily be extended to other applications by simply adjusting one
parameter, the size of the target gas molecule, and to other classes
of nanoporous materials (metal–organic frameworks, zeolitic
imidazolate frameworks, porous polymer networks, etc.) by taking into
account information about energy or charge.
Authors: Richard L Martin; Thomas F Willems; Li-Chiang Lin; Jihan Kim; Joseph A Swisher; Berend Smit; Maciej Haranczyk Journal: Chemphyschem Date: 2012-08-22 Impact factor: 3.102
Authors: D Ushizima; D Morozov; G H Weber; A G C Bianchi; J A Sethian; E W Bethel Journal: IEEE Trans Vis Comput Graph Date: 2012-12 Impact factor: 4.579
Authors: Li-Chiang Lin; Adam H Berger; Richard L Martin; Jihan Kim; Joseph A Swisher; Kuldeep Jariwala; Chris H Rycroft; Abhoyjit S Bhown; Michael W Deem; Maciej Haranczyk; Berend Smit Journal: Nat Mater Date: 2012-05-27 Impact factor: 43.841
Authors: P Y Lum; G Singh; A Lehman; T Ishkanov; M Vejdemo-Johansson; M Alagappan; J Carlsson; G Carlsson Journal: Sci Rep Date: 2013-02-07 Impact factor: 4.379
Authors: Christopher Rzepa; Daniel W Siderius; Harold W Hatch; Vincent K Shen; Srinivas Rangarajan; Jeetain Mittal Journal: J Phys Chem C Nanomater Interfaces Date: 2020 Impact factor: 4.126
Authors: Yasmine S Al-Hamdani; Péter R Nagy; Andrea Zen; Dennis Barton; Mihály Kállay; Jan Gerit Brandenburg; Alexandre Tkatchenko Journal: Nat Commun Date: 2021-06-24 Impact factor: 14.919