Thomas C Nicholas1, Eugeny V Alexandrov2,3,4, Vladislav A Blatov2,3, Alexander P Shevchenko2,4, Davide M Proserpio5,2, Andrew L Goodwin1, Volker L Deringer1. 1. Department of Chemistry, Inorganic Chemistry Laboratory, University of Oxford, Oxford OX1 3QR, U.K. 2. Samara Center for Theoretical Material Science (SCTMS) Samara State Technical University, Molodogvardeyskaya Street 244, Samara 443100, Russian Federation. 3. Samara University, Ac. Pavlov Street 1, Samara 443011, Russian Federation. 4. Samara Branch of P.N. Lebedev Physical Institute of the Russian Academy of Science, Novo-Sadovaya Street 221, Samara 443011, Russian Federation. 5. Dipartimento di Chimica, Università Degli Studi di Milano, Milano 20133, Italy.
Abstract
With ever-growing numbers of metal-organic framework (MOF) materials being reported, new computational approaches are required for a quantitative understanding of structure-property correlations in MOFs. Here, we show how structural coarse-graining and embedding ("unsupervised learning") schemes can together give new insights into the geometric diversity of MOF structures. Based on a curated data set of 1262 reported experimental structures, we automatically generate coarse-grained and rescaled representations which we couple to a kernel-based similarity metric and to widely used embedding schemes. This approach allows us to visualize the breadth of geometric diversity within individual topologies and to quantify the distributions of local and global similarities across the structural space of MOFs. The methodology is implemented in an openly available Python package and is expected to be useful in future high-throughput studies.
With ever-growing numbers of metal-organic framework (MOF) materials being reported, new computational approaches are required for a quantitative understanding of structure-property correlations in MOFs. Here, we show how structural coarse-graining and embedding ("unsupervised learning") schemes can together give new insights into the geometric diversity of MOF structures. Based on a curated data set of 1262 reported experimental structures, we automatically generate coarse-grained and rescaled representations which we couple to a kernel-based similarity metric and to widely used embedding schemes. This approach allows us to visualize the breadth of geometric diversity within individual topologies and to quantify the distributions of local and global similarities across the structural space of MOFs. The methodology is implemented in an openly available Python package and is expected to be useful in future high-throughput studies.
A cornucopia of experimentally
determined crystal structures is
described in continually expanding databases.[1,2] These
databases are now reaching sufficient sizes to provide an opportunity
for extracting structure–property relationships based on data
mining and machine learning (ML), in principle.[3−5] Establishing
these relationships, however, is a nontrivial task because there are
multiple ways by which crystal structures can be represented, compared,
and analyzed. This challenge is particularly acute for metal–organic
frameworks (MOFs) where the diversity of both the metal centers (“nodes”)
and the organic linkers gives rise to considerable structural complexity.[6−9]One of the ways in which MOFs are commonly described is in
terms
of their topology; that is, the connectivity between nodes and linkers.
Topological analysis routines are well-established and are implemented
in automated computer packages, such as ToposPro(10) and Systre,[11] and have been found to be useful predictors for a number
of material properties. For example, by considering the deformability afforded by a given topology (i.e., the ability of a framework to
distort geometrically without disrupting the net connectivity), one
may predict the rigidness of a framework,[12] the tendency to interpenetrate,[13] and elastic properties.[14,15] Ultimately,
by exploiting knowledge of how a given choice of the node and linker
will give rise to a particular topology, one can hope to design new
MOFs.[16−18]However, two MOFs with the same topology may have
very different network geometries. By the latter
term, we mean the spatial arrangement of local atomic environments
in a MOF, including variations in bond lengths, angles, and longer-range
ordering—all of which may affect material properties. For example,
positive and negative thermal expansion can be switched in MOFs of
a given topology simply by varying their geometry.[19] By their very nature, however, each of the established
geometric descriptors will cover only individual aspects of the structure;
two MOFs with similar metal–linker distances might have different
porosities, for example.Atom-density-based representations
offer an alternative means of
quantifying the geometric similarity of crystal structures.[20−22] One class of such metrics originated in the field of ML for physics
and chemistry applications, where the development of structural descriptors
for the atomistic structure is one of the central research tasks.[23−27] In particular, the smooth overlap of atomic positions (SOAP) descriptor
was developed initially in the context of fitting machine-learned
interatomic potentials.[28] Subsequently,
it was shown how similarity kernels based on the SOAP formalism may
be used to analyze the structural similarity for molecular and bulk
periodic structures.[29]By coupling
to ML techniques such as dimensionality reduction and
data clustering, one can begin to navigate complex configuration spaces[30−32] and search for underlying structure–property relationships.[33−38] We have recently demonstrated that a combination of coarse-graining,
rescaling, and SOAP analysis enables the geometric comparison between
very different classes of materials, exemplified using a database
of the AB2 hybrid and inorganic networks.[39] Other studies have emphasized the usefulness of unsupervised[31] and supervised[5] ML
for MOFs, and very recently, local coordination environments were
used as features for predicting oxidation states in these materials.[40]Here, with a view to facilitate further
quantitative studies of
the geometric structure and structure–property relationships
in MOFs, we describe a generalized coarse-graining approach for such
purposes and its implementation in an openly available Python package,
which we call CHIC (“Coarse graining of Hybrid
and Inorganic Crystals”)—expanding widely on the initial
work in ref (39). Using
a curated test set of four-connected AB2 coordination networks,
we validate structure outputs of our implementation using the well-established ToposPro topology analysis and then discuss examples of
structural and chemical analysis that are enabled by our approach.
Methodology
The computational methodology can be separated into two stages:
structure processing and structure analysis (Figure ). Each stage will be discussed in turn in
the following subsections. References to functions contained within
the code will be highlighted in typewriter typeface throughout. We
summarize the main aspects here; a separate tutorial and a practical
demonstration of using the code can be found in an interactive Jupyter
notebook (as detailed in the Data and Code Availability section).
Figure 1
Methodology for processing and analyzing MOF structure data sets
using a cg-SOAP-based approach (cf. ref (39)). Left: structures from the
experimental literature may first need to be “cleaned-up”
for analysis. Here, as in ref (39), ZIF-8 is used as an example: the H-disorder is resolved
using average positions of half-occupancy sites, and guest atoms in
the pores are removed. The structure is then reduced to its constituent
building blocks, viz., 12 Zn2+ A sites (purple) and 24 methyl imidazolate B sites; the latter sites are coarse-grained
using connectivity graphs (here, identifying the five-membered imidazolate
rings as the B-site centers, yellow). Right: the SOAP kernel is used to determine the similarity between all
coarse-grained and rescaled structures in the database, and the result
is visualized using dimensionality reduction. The resultant 2-D structure
map (sketched here in a purely schematic way) may then be interpreted
by correlating the locations of data points with structural and physical
properties.
Methodology for processing and analyzing MOF structure data sets
using a cg-SOAP-based approach (cf. ref (39)). Left: structures from the
experimental literature may first need to be “cleaned-up”
for analysis. Here, as in ref (39), ZIF-8 is used as an example: the H-disorder is resolved
using average positions of half-occupancy sites, and guest atoms in
the pores are removed. The structure is then reduced to its constituent
building blocks, viz., 12 Zn2+ A sites (purple) and 24 methyl imidazolate B sites; the latter sites are coarse-grained
using connectivity graphs (here, identifying the five-membered imidazolate
rings as the B-site centers, yellow). Right: the SOAP kernel is used to determine the similarity between all
coarse-grained and rescaled structures in the database, and the result
is visualized using dimensionality reduction. The resultant 2-D structure
map (sketched here in a purely schematic way) may then be interpreted
by correlating the locations of data points with structural and physical
properties.
Structure Processing
The main structure-processing
routines of our implementation are handled by a Python class called Structure. It uses Pymatgen(41) to parse and store information in the crystallographic
information file (CIF) format. All subsequent structure-processing
tasks are then callable as attributes of the Structure class as follows.
sort_sites()
Initially, the atomic species are classified
into two categories:
A (metal) sites and “other” sites. In the absence of
user-specified categories, elements are sorted according to the IUPAC
International Chemical Identifier classification:[42] metals are assigned to A sites and nonmetals are assigned
to “other”. In the present work, we focus on the geometric
structure of the underlying nets (as opposed to also considering the
energetic stability provided by guest ions and molecules), and therefore,
we remove nonframework species. Alkali and alkaline-earth metals (excluding
Li and Be) are often nonframework atoms and are therefore removed
during the species sorting algorithm (and ultimately from the final
coarse-grained structure). B, Si, and P are also sorted on a case-by-case
basis because there are many instances in which they behave as either
A or “other” sites (e.g., boron is an A site in boron
imidazolate frameworks[43] but a nonframework-site
atom when a part of a BF4– anion). Therefore, if the element is bound to O or
N—indicative of Lewis-acidic-type behavior—it is assigned
as an A site. It should be noted that such modifications to the sorting
algorithm are made based on the chemical nature of the data set being
studied and serve primarily to minimize the requirement for user input
during the coarse-graining procedure. The modifications may therefore
be amended or removed in future work, without affecting the conclusions
herein.
repair_disorder()
Site disorder occurs frequently in MOFs, and it is commonly modeled
crystallographically using split sites with partial occupancies.[44] To enable structure-processing tasks that require
discrete atomic positions, partially occupied (or disordered) sites
are simplified using three algorithms, callable with the Structure.repair_disorder() method. First, atoms of the
same species found within a given distance range (0–0.7 Å
by default) are replaced by an atom of the same type, situated on
the average of their positions. The remaining two routines deal with
delocalized electron density associated with guest species within
framework pores, which is often modeled by clusters of (fictitious)
oxygen atoms. These clusters are removed by identifying either oxygen
atoms that only have oxygen nearest neighbors or
oxygen atoms that have no neighbors at all within a set cutoff radius.
Not all disorder needs to be resolved for the general coarse-graining
procedure to work; the minimum requirement is that individual molecules
(e.g., organic ligands) do not overlap. However, unresolved disorder
may affect the placement of the discrete place-holder atoms (e.g.,
by skewing the centroid of a given molecule).
reduce()
Each unique
building block fragment, or ligand, is then identified
by a “nearest-neighbor crawl” algorithm by calling the Structure.reduce() method. The default neighboring-site
routine used is CrystalNN,[45] as implemented in Pymatgen.[41] In short, CrystalNN uses Voronoi decomposition
to assign weights to an initial list of neighbors within a hard cutoff
radius and then normalizes and assigns probabilities to each unique
weighting according to a smooth cutoff function. With this definition,
a building block fragment list is initialized with a given atom from
a given building block type, and if another atom is both a nearest-neighbor
and of the same building block type, it is added to the list. These
neighbors are then searched for their respective nearest-neighbors,
and the process is iterated (thereby “crawling” round
the ligand) until the list size converges. Once converged, the atomic
species, positions, and connectivity (in the form of a connectivity
graph) of the fragment are stored as a buildingUnit class instance and appended to the Structure.units class attribute. The algorithm repeats until all atoms have been
classified.
coarse_grain()
To coarse-grain the building blocks, a discrete
bonding center
must first be defined. There is seldom a unique choice for this. The
simplest definition is to take the geometric centroid of all of the
atomic positions in the building unit (i.e., not weighted by the atomic
masses). This parallels the methodology used in both crystal net determination
(often referred to as “equilibrium” or “barycentric”
placement)[11] and coarse-grained molecular
dynamic simulations[46] and is the definition
used in the present work. The buildingUnit class
stores the connectivity of a given building unit as a graph object
using the NetworkX package,[47] thereby enabling alternative bonding centers to be defined. These
alternative definitions are not used in the present work but are explored
in a Jupyter notebook that is provided with the code (see the Data and Code Availability section).The building-unit
connectivity is defined as the number of nearest-neighbor atoms of
a different site type which are connected to a given fragment. All
fragments with connectivity >1 (thereby distinguishing nonframework
species, such as solvents, from the B sites) are coarse-grained by
placing a dummy atom at the chosen bonding center and removing all
other atoms. Finally, all atoms of a given building-block type are
assigned to the same atomic species. The processed structures can
then be output in the CIF format.
Structure Analysis
Studying large, complex data sets
requires generalized analysis routines. At the core of our analysis
class, StructureMap, structures are compared
using SOAP[28] which we apply to coarse-grained
structural models (indicated by “cg-SOAP”).[39] Using dimensionality reduction algorithms, configurations
can be visualized and the relationships between them then analyzed.
cg-SOAP
The SOAP kernel measures the similarity of
pairs of atomic environments.[28,29,32] Formally, for each atom, α, an atomic density, ρα(r), is constructed with a sum of Gaussians
of broadness σ, centered on each neighbor, β, of α
(and on α itself):where fcut denotes a cutoff function.
The SOAP kernel is then defined as the overlap integral of any two
neighbor densities, integrated over all three-dimensional rotations R̂,[28]where the exponent is typically
set to n > 1 to retain angular information.[29] In practice, it is computationally more efficient
to expand the atomic density in a set of orthogonal radial basis functions
and spherical harmonics up to a given nmax and lmax. The resulting combination
coefficients on their own do not yet ensure rotational invariance
(because all spherical harmonics with l > 0 depend
on the angular orientation) and are therefore collected into a power
spectrum vector. The SOAP kernel may then be calculated by taking
the normalized dot product of the two power spectrum vectors associated
with each atomic environment, raised to an exponent ζ which
serves to accentuate the distinction between the two environments.[28]As shown in ref (48) for elemental structures
and in ref (39) for
a range of inorganic and hybrid materials,
geometric similarity may be assessed using SOAP for uniformly rescaled
structures, enabling direct comparison irrespective of characteristic
A–B distances. We have implemented two scaling approaches:
either scaling to a uniform minimumr(A–B) distance or scaling to a uniform averager(A–B) distance.To extend the similarity
measure beyond comparing individual atomic
environments, similarities between pairs of atoms in each of the crystal
structures are calculated:where α (β) runs over all atomic
sites, , in the unit
cell of structure i (j), respectively.
Variations of this
method are also implemented where the atomic sites considered are
restricted to a given site type (e.g., A sites), thereby shifting
the focus of the similarity analysis toward those particular sites.
(In this case, information regarding the other site types is still
implicitly encoded through the neighbor densities.)
Dimensionality
Reduction and Visualization
To interpret
(cg-)SOAP analysis results, the data set is often visualized as a
two-dimensional projection.[29,32,39] A large number of algorithms are available to carry out this projection
(or “embedding”), and a central aspect of the present
work will be to compare different widely used embedding schemes. Our
implementation stores the similarity of all structures with one another
in the form of a symmetric similarity matrix, K, which
we construct using the per-cell averaged similarity (eq ), viz., . The corresponding geometric distance matrix, D, may
also be defined with elementsto satisfy the triangle inequality.[29] We
currently provide interfaces to the following
dimensionality reduction algorithms implemented in external code packages:
multidimensional scaling (MDS),[49] t-distributed
stochastic neighbor embedding (t-SNE),[50] and the uniform manifold approximation and projection (UMAP).[51]
Bonding and Properties
During the
reduction of MOF
structures to their coarse-grained representations, a dictionary of
bonds between building units is stored, as recently defined in the
IUCr topology dictionary (topoCIF). This enables the CIF output to
contain the requisite information to readily construct the underlying
net,[52,53] that is, the net of building units, and
to calculate its topological descriptors (e.g., using ToposPro). It also enables the calculation of local-environment properties,
including bond lengths, angles, and order parameters. We use a module,
referred to as bonding, to extract this geometric
information and calculate Chau–Hardwick tetrahedral order parameters
and Steinhardt bond order parameters.[54−56] This module might also
be extended for other custom analyses.Two global structure properties are also included in the routine structure
analysis. The first is A-site density, an important material descriptor
when considering the potential void space present in a framework.
The second property is the A-site SOAP heterogeneity introduced in
previous work,[39] which measures the diversity
of the A-site environments in a given structure. A value of zero means
that all A-site environments are geometrically equivalent (up to the
SOAP cutoff radius); a higher value indicates greater diversity. It
is calculated aswhere P is the set of (unordered)
pairs of distinct A-site environments in the structure, is
the number of A-site atoms in the structure,
and k is the SOAP similarity kernel defined in eq .
Database Details
The data set curated for this work
expands upon the AB2 study reported in ref (39), now focusing on analyzing
the diversity in the wider set of AB2 MOFs with two-connected
ligands. Restricting the study to a single coordination formula and
ligand connectivity enables careful validation of the reported coarse-graining
methods and a thorough examination of the results; in particular,
being able to relate trends in our configuration space to local geometric
properties of individual structures serves as a useful tool for understanding
what information is captured in the cg-SOAP approach.Primary
data were selected from the sample that was prepared for ref (13), filtering for all structures
in which the coordination formula was AB2 and the ligand
(B) was two-connected. The sample contained 1160 crystal structures,
to which 102 structures from the CSD 5.42 update 1 (Feb 2021) were
added. A complete list of the 1262 crystal structures and their topological
descriptors is provided in Supporting Information (see the file CF_A_B2_1262.xlsx). The “experimental literature”
to be processed was exported as CIFs from the CSD using the 1262 entry
refcodes and processed unchanged.
Results and Discussion
Visualizing
Geometric Diversity
We begin by visualizing
the geometric diversity in the AB2 MOF data set, here embedding
cg-SOAP similarities for all (uniformly scaled) structures in two
dimensions. Dimensionality reduction is an unsupervised ML task, aiming
to extract information from unlabeled data sets,
for example, by clustering the data into groups. We start by visualizing
the data set using one of the simplest algorithms, MDS (as used in
ref (39)), in Figure . The structure map
carries the intuitive interpretation that structures that are similar
appear close together and structures that are dissimilar are further
apart. In the context of MOFs, structures are generally reported with
their topological identifier which uniquely identifies the underlying
net. We highlight the distribution of the three most commonly occurring
topologies in the data set—the diamond-like dia net (189 structures), the sodalite-like sod net (136
structures), and the fourfold interpenetrated 4#dia net
(116 structures)—to investigate the degree to which topological
variation is represented in the cg-SOAP representation. Strikingly,
all three topological families have structures dispersed over the
map, with the dia topology being most widely distributed
among the three (light blue). There are regions of overlapping points,
particularly for dia and sod in the lower
part of Figure , indicating
that topologically dissimilar MOFs might in fact show similarities
in terms of their geometric structure.
Figure 2
Geometric diversity within
isoreticular groups of AB2 MOFs. The graph shows a two-dimensional
visualization of the cg-SOAP-based
structural distances using an MDS embedding: generally, the closer
the two points are, the more similar their coarse-grained and rescaled
geometric structures are. The distribution of the three most commonly
occurring topologies within our data set (dia, sod, and fourfold-interpenetrated dia, denoted
as 4#dia) is emphasized by lines that originate from
the respective centroid. Data set entries with different topologies
than the three aforementioned ones are all represented by gray points.
Geometric diversity within
isoreticular groups of AB2 MOFs. The graph shows a two-dimensional
visualization of the cg-SOAP-based
structural distances using an MDS embedding: generally, the closer
the two points are, the more similar their coarse-grained and rescaled
geometric structures are. The distribution of the three most commonly
occurring topologies within our data set (dia, sod, and fourfold-interpenetrated dia, denoted
as 4#dia) is emphasized by lines that originate from
the respective centroid. Data set entries with different topologies
than the three aforementioned ones are all represented by gray points.We quantify the relative distributions of each
topology in the
map based on how far, on average, the structures of that group are
from their respective centroid. These relative distributions are 0.771,
0.386, and 0.441 for dia, sod, and 4#dia, respectively. For ease of comparison, we normalize
the relative distributions such that the value for dia is unity (Table ).
Table 1
Characteristics of the Ten Most Commonly
Occurring Topologies in the Data Set
geometric densitya
relative
distribution
topology
occurrences
(mean ± SD)
MDS
t-SNE
UMAP
dia
189
125 ± 56
1 (reference)
sod
136
80 ± 26
0.50
0.45
0.60
4#dia
116
261 ± 77
0.57
0.47
0.44
5#dia
95
271 ± 87
0.67
0.47
0.50
3#dia
88
206 ± 76
0.71
0.48
0.45
2#dia
79
182 ± 82
0.91
0.76
0.55
cds
41
221 ± 87
0.81
0.54
0.44
qzd
38
156 ± 15
0.12
0.18
0.26
6#dia
36
336 ± 137
0.74
0.55
0.47
3#dmp
33
292 ± 49
0.38
0.23
0.31
The geometric density is here defined
as n(A) × 1000/Vscaled, where n(A) is the number of A sites in the unit
cell and Vscaled is the volume of the
scaled unit cell.
The geometric density is here defined
as n(A) × 1000/Vscaled, where n(A) is the number of A sites in the unit
cell and Vscaled is the volume of the
scaled unit cell.We see
that dia networks can be formed with a much
larger variety of geometries compared to sod; this is
in accordance with the conclusions derived from an analysis of the
natural tiles (cages).[13] This reflects
a larger deformability of diamondoid structures and adaptability to
building blocks with geometries spanning a wide range of volumes,
lengths, and angles. This feature promotes the dominance of the dia topology in coordination networks.[13] The folding of the dia networks into interpenetrating
arrays, however, significantly restricts the diversity of acceptable
network geometries.Cluster analysis can assist the interpretation
of complex data
sets by grouping data and identifying representative examples, from
which patterns can more readily be identified. Affinity propagation
is a clustering algorithm that views each data point as a node in
a network and recursively minimizes the edge weights between nodes;
the magnitude of each edge at a given time reflects the current affinity
the point has for selecting the second point as its “examplar”.[57]Figure a shows the dia structures (light blue data points
in Figure ) now divided
into four clusters, with the exemplar coarse-grained structures visualized.
From these clusters, we investigate the distribution of structural
properties in different regions of the map in order to appreciate
the geometric diversity available within the dia topology.
In the context of MOFs, low metal densities are the simplest indication
for the presence of void space: an important feature for catalytic
applications. We define the geometric density as the (unitless) density
of the structures that have been coarse-grained and scaled (to unity
minimum r(A–B) bond length). The geometric density is related to the experimental metal density
by the characteristic framework bond length. We plot the distribution
of the geometric density, average angular component of the Chau–Hardwick
order parameter, Sg (a value of zero corresponds
to ideal tetrahedral bond angles; a value of unity would correspond
to the extreme case where all four bonds are superimposed), and the
average A−B−A angle in Figure b–d, respectively.
Figure 3
Geometric diversity within
MOFs of dia topology. (a)
From the data set characterized in Figure , we isolate the dia entries
and analyze their distribution using a clustering algorithm, viz.,
affinity propagation.[57] For each cluster,
the algorithm selects an “examplar” datapoint, and the
corresponding coarse-grained structures and their CSD refcodes are
shown. Distributions of local properties for each cluster are presented
on the right-hand side: (b) the geometric density, (c) the average
angular component of the Chau–Hardwick order parameter, Sg, and (d) the average A−B−A angle.
Throughout this paper, box plots are drawn such that boxes range from
the 25th to the 75th percentile, with the median indicated by a horizontal
line; whiskers span ±1.5 times the interquartile range, and points
outside this range are plotted with circle markers.
Geometric diversity within
MOFs of dia topology. (a)
From the data set characterized in Figure , we isolate the dia entries
and analyze their distribution using a clustering algorithm, viz.,
affinity propagation.[57] For each cluster,
the algorithm selects an “examplar” datapoint, and the
corresponding coarse-grained structures and their CSD refcodes are
shown. Distributions of local properties for each cluster are presented
on the right-hand side: (b) the geometric density, (c) the average
angular component of the Chau–Hardwick order parameter, Sg, and (d) the average A−B−A angle.
Throughout this paper, box plots are drawn such that boxes range from
the 25th to the 75th percentile, with the median indicated by a horizontal
line; whiskers span ±1.5 times the interquartile range, and points
outside this range are plotted with circle markers.Figure b
shows
the distribution of geometric density for each cluster. There is a
relatively clear separation between each group, with cluster 2 (orange) containing notably more dense frameworks. It is
interesting to note that clusters 2 and 4, which have higher average geometric density, also seem to have
a broader distribution across the MDS map, relative to the narrow
distributions of clusters 1 and 3. This
could, in part, be explained by the relatively high deviation away
from ideal tetrahedral bond angles about the A sites, as evident from
the plot of the angular Chau-Hardwick tetrahedral order parameter, Sg (Figure c). Figure d shows the distribution of the average A−B−A
bond angle; again, clusters 2 and 4 demonstrate
broad distributions with lower average values.We extend the
analysis of the dia subset of structures
using parameters extracted using ToposPro;[53] in particular, the tile average distortion[13] supports the results illustrated in Figure . Clusters with higher
average geometric density have larger distortions in the tetrahedral
coordination of the A sites, corresponding to a collapse of the tiles.
Conversely, the most porous structures have A site coordination environments
closer to an ideal tetrahedron and the largest proportion of tiles
close to the adamantane tile of the ideal dia net. Analogously,
structures with A−B−A angles less than 150° also
correspond to denser structures due to the collapse of tiles.Figure emphasizes
that a single topological classification may give rise to a geometrically
diverse set of structures; that is to say, one cannot necessarily
predict the geometric features of a structure from its topological
label alone. From Figure , we therefore infer that the study of the latent geometric configuration space may provide insights into
a different set of material properties (e.g., bulk modulus) compared
to those which are accounted for by the topology (e.g., porosity[13]).
Embedding Schemes
The universal
aim of dimensionality
reduction algorithms is to capture a meaningful structure in high-dimensional
data when embedded into low dimensions. However, it is imperative
to consider the algorithm methodology when interpreting the structure
map. To illustrate this point, we have visualized our data set with
three dimensionality reduction algorithms; namely, MDS,[49] t-SNE,[50] and UMAP[51] (Figure ). Our principal aim is to demonstrate that the interpretation
of our cg-SOAP representations is invariant to the specific embedding
scheme chosen; indeed, there are other algorithms available (e.g.,
kernel principal component analysis[58] and
variants thereof[59]) that are not considered
in this study but could be used in future work for alternative and/or
complementary interpretations of the configuration space. In order
to understand how each representation differs from one another, we
color-code the map by the geometric density (Figure a) and by the distributions of dia, sod, and 4#dia topologies (Figure b, cf. Figure ), and compare side-by-side how these properties
are represented in the results of different embedding schemes.
Figure 4
Visualizing
geometric diversity in MOFs using different embedding
schemes. The plots compare the results of multidimensional scaling
(MDS,[49]left), t-distributed stochastic neighbor embedding (t-SNE,[50]center), and uniform manifold
approximation and projection (UMAP,[51]right). For each embedding scheme, we show two-dimensional
structure maps characterizing the coarse-grained and scaled AB2 MOF data set, colored (a) by geometric density,
and (b) with the distribution of the three most commonly occurring
topologies in the data set, dia, sod, and 4#dia, highlighting the respective centroids as in Figure .
Visualizing
geometric diversity in MOFs using different embedding
schemes. The plots compare the results of multidimensional scaling
(MDS,[49]left), t-distributed stochastic neighbor embedding (t-SNE,[50]center), and uniform manifold
approximation and projection (UMAP,[51]right). For each embedding scheme, we show two-dimensional
structure maps characterizing the coarse-grained and scaled AB2 MOF data set, colored (a) by geometric density,
and (b) with the distribution of the three most commonly occurring
topologies in the data set, dia, sod, and 4#dia, highlighting the respective centroids as in Figure .All three embeddings demonstrate a strong correlation of
the distribution
of points with geometric density and qualitatively similar trends
in the relative distributions of topologies within each structure
map (Table ). The
detail, however, varies. MDS leads to a distinct region in the center
of the map where no structures are found. This perhaps suggests that
this algorithm, which seeks to minimize a relatively simple loss function,
is reaching its limits of usefulness for the size and complexity of
the data set considered here.The “island-like”
features in both the t-SNE and
UMAP representations—or more specifically, the absence of islands
in the MDS representation—further emphasize this point. The
most prominent island (right-hand side of the t-SNE map and upper-right
of the UMAP map) is a family of qzd MOFs, with the smallest
relative distribution, among the 10 most commonly occurring topologies
in the data set, for MDS and t-SNE (Table ). Manual inspection of the structures reveals
a high frequency of similar CCDC reference codes (refcodes). The refcode
system is designed to group together structures into families, such
as the same compound having been crystallized and characterized under
different conditions, or polymorphs of the same compound. There are
two families of refcodes within the qzd MOF cluster:
LIWDEB (13 occurrences) and UKUVOL (21 occurrences).[60,61] All of these structures are one of two isomers of [{Cu(succinate)
(4,4′-bipyridine)}], isolated
under different experimental conditions, and thus could be considered
duplicates. The high geometric similarity between duplicates relative
to the similarity between other structures in the data set skews the
representation toward creating an isolated cluster.
Islands and
Duplicates
The occurrence of “islands”
of structures, separated out near the edge of the t-SNE and UMAP representations,
requires a more subtle interpretation. For example, one of the islands
to the left of the UMAP map corresponds to structures with zni and coi topologies (α and β polymorphs
of Zn(Im)2, respectively), predominantly classified within
the IMIDZB refcode family. “Duplicates” cannot necessarily
be identified as those with a common refcode; however, cg-SOAP screening
can help automate this procedure. To illustrate this point, we isolate
the IMIDZB family of structures from the database, create a structure
map using t-SNE (because this algorithm generally achieves clearer
clustering of data into distinct regions, albeit at the cost of meaningful
intercluster distances), and analyze the representation using affinity
propagation, as shown in Figure .
Figure 5
Using cg-SOAP analysis to identify relationships between
individual
CSD entries with a single refcode. The example case here is given
by the polymorphs of Zn(Im)2, with CSD refcodes IMIDZB,
suffixed with numbers representing a running index. The geometric
diversity within the IMIDZB refcode family is visualized using t-SNE
embedding and a cluster analysis by affinity propagation. The structures
are labeled according to topology. The zni polymorphs
cluster together; it could therefore be beneficial to choose a representative structure from this cluster. The spread of
the cag topology into two distinct clusters illustrates
the complexity of “duplicate” detection, as discussed
in the text.
Using cg-SOAP analysis to identify relationships between
individual
CSD entries with a single refcode. The example case here is given
by the polymorphs of Zn(Im)2, with CSD refcodes IMIDZB,
suffixed with numbers representing a running index. The geometric
diversity within the IMIDZB refcode family is visualized using t-SNE
embedding and a cluster analysis by affinity propagation. The structures
are labeled according to topology. The zni polymorphs
cluster together; it could therefore be beneficial to choose a representative structure from this cluster. The spread of
the cag topology into two distinct clusters illustrates
the complexity of “duplicate” detection, as discussed
in the text.The IMIDZB refcode family contains
different polymorphs of Zn(Im)2 with different characteristic
geometries and topologies,
which therefore separate in the cg-SOAP map. By analyzing the geometric
diversity in this smaller configuration space, we propose an automated
“duplicate” structure identification procedure. The zni topology is the densest, most stable crystalline polymorph
of Zn(Im)2, and all structures of this connectivity are
found in the same cluster, labeled 1. One structure (e.g.,
the cluster examplar selected by the affinity propagation algorithm)
could be taken as representative of this particular polymorph. The
distribution of cag frameworks across two clusters (2 and 3), however, is an example where frameworks
with identical composition, connectivity, and space group display
diverse geometries, whereas IMIDZB11 corresponds to the desolvated
ZIF-4 framework under ambient conditions (298 K, 1 atm) and IMIDZB12
and IMIDZB15 are the same frameworks after decreasing temperature
(80 K) and increasing pressure (0.15 GPa), respectively, exemplifying the “breathing” effect
in the frameworks.[62,63] With the change in external stimuli,
the frameworks become more dense: the average A−B−A
angle decreases and Sg increases, corresponding
to a lowered “tetrahedrality” around the Zn (A) sites.
On these grounds, it may be desirable to keep one structure from each
cluster, that is, IMIDZB10 (cluster 2) and IMIDZB12 or IMIDZB15 (cluster 3), in order to capture
the geometric diversity fully.More generally, one could propose
an algorithm that classifies
duplicates by considering the refcode, topology, and cg-SOAP similarity
as an automated approach that makes it possible to preserve the subtle
geometric diversity that arises from varying experimental conditions.
The screening of duplicates is expected to be helpful (and indeed
required) for moving to very large databases in the future, as demonstrated
in a recent analysis of DFT-optimized data sets of MOFs.[64]
Quantifying Geometric Diversity
Our approach also enables
quantitative investigation of local structural properties and how
they are distributed for different categories of structures (e.g.,
topology). For example, interpenetration is commonly found in MOFs,
and it holds implications for the potential functionality of a given
compound because it is closely related to porosity.[65] Generally, porous materials minimize the energy of the
framework through optimal filling of void space, and thus, in cases
where void space is of sufficient size, interpenetration may be observed.
Controlling the degree of interpenetration has been explored using
subtle changes in the synthetic methodology, such as varying reaction
conditions, templating agents, and ligand design.[66,67] Given the increasing number of MOF crystal structures reported,
the question arises as to whether we can postrationalize the extent
to which the local geometry influences the tendency to interpenetrate.In Figure , we
show the distributions of local properties for each structure, for
different degrees of interpenetration of the diamond-like net (Z = 1 corresponds to dia MOFs, Z = 2 corresponds to 2#dia, and so on). In Figure a, we show the distribution
of A-site heterogeneity values (eq ). Figure b,c illustrates the distributions of established local property
descriptors; namely, Sg and the average
A−B−A angle, respectively.
Figure 6
Distributions of geometric
property indicators for increasing degrees
of interpenetration, Z, of the dia-based
MOFs in the data set. The figure shows (a) A-site heterogeneity (eq ), (b) average angular
component, Sg, of the Chau–Hardwick
order parameter, and (c) average A−B−A angle.
Distributions of geometric
property indicators for increasing degrees
of interpenetration, Z, of the dia-based
MOFs in the data set. The figure shows (a) A-site heterogeneity (eq ), (b) average angular
component, Sg, of the Chau–Hardwick
order parameter, and (c) average A−B−A angle.We note that the majority of dia MOFs
have locally
homogeneous A sites, and this homogeneity does not appear to substantially
depend on the degree of interpenetration (Figure a). Similarly, the distribution of the average Sg does not show a clear correlation with the
degree of interpenetration (Figure b). There is, in contrast, a much stronger correlation
with the A−B−A angle: as the degree of interpenetration
increases, the average A−B−A angle tends toward 180°
(Figure c). This makes
intuitive sense when one considers that longer, “rod-like”
ligands gives rise to greater void space and therefore enable a greater
degree of interpenetration, which is consistent with the synthetic
approach of employing longer spacer ligands to target higher degrees
of interpenetration.[68−70] It may be inferred from these distributions that
the A sites maintain a similar environment, irrespective of the degree
of interpenetration, whereas the linker geometry plays a crucial role
in determining this property.Finally, we extend this quantitative
analysis to all topologies
that occur at least five times in the data set, plotting the distribution
of A-site heterogeneity values in the respective structures in Figure .
Figure 7
Distributions of A-site
heterogeneity (eq )
for all structures in the data set for which
the topology occurs ≥5 times. Three example MOFs are highlighted
with coarse-grained structural representations, CSD refcodes, and
the corresponding topology symbol and are discussed in the text.
Distributions of A-site
heterogeneity (eq )
for all structures in the data set for which
the topology occurs ≥5 times. Three example MOFs are highlighted
with coarse-grained structural representations, CSD refcodes, and
the corresponding topology symbol and are discussed in the text.The qzd structures have a distinctly
narrow distribution
of A-site heterogeneity values, which reinforces the hypothesis that
the separation from the main body of structures in the t-SNE and UMAP
embeddings (Figure ) is a skewing of the visualization as a result of duplicate structures.
It is also interesting to note that the zni topology
has a narrow A-site heterogeneity distribution at a relatively high
average value, which likely contributes to the separation of the zni structures.Figure highlights
examples of topologies with particularly low (qzd) and
particularly high (4T13, pts, and 2#pts) A-site heterogeneity, for which illustrative coarse-grained
and scaled crystal structures are visualized. The structures with pts and 2#pts topology (which are related by
increasing from a single to twofold interpenetrated net) all contain
both tetrahedral and square planar geometries about different A sites,
typically by combining Zn (tetrahedral geometric preference) with
any of Ni, Cu, Pt, or Pd (square-planar geometric preference). Some
frameworks have Cu in both the tetrahedral and square-planar sites
of the pts framework. When combined with Au, Cu/Ag occupy
the tetrahedral sites in the 2#pts framework. One might
consider attempting to target these pts topologies, therefore,
by selecting metals with the appropriate geometric preferences demanded
by the framework.The high degree of heterogeneity in the 4T13 frameworks
can be ascribed to a tension between the differing lengths of two
organic linkers, on the one hand, and the additional flexibility afforded
using two linkers, on the other hand (cf. DEBWAK in Figure ). The 4T13 frameworks
have a short linker (e.g., isophthalate in DEBWAK) and a long-chain,
flexible organic linker (e.g., N,N′-bis(pyridin-4-yl)-2,2′-bipyridine-5,5′-dicarboxamide
in DEBWAK).[71,72] The resultant framework has 1-D
helical chains that cross each other to create a 2-D molecular braid
with geometrically distinct Zn sites. Automated, quantitative analyses
such as those exemplified in Figures and 7 should be a helpful part
of the methodology used in future work for understanding the geometric
and structural diversity in databases of materials.
Conclusions
We have studied the structures of AB2 MOFs containing
a diverse set of two-connected organic linkers. By coupling a cg-SOAP
approach to different embedding schemes, we have visualized and analyzed
the geometric diversity in a database of MOF structures. With the
aid of cluster analysis, the structure maps of the AB2 MOF
configuration space can be better understood. Here, we have focused
on clustering within the low-dimensional embedding and demonstrated
how the location of structures in the map was consistent with the
grouping of structures with similar structural properties. We described
the cg-SOAP and visualization methodology implemented in a Python
package and validated the routines by confirming that the underlying
net was correct, using ToposPro software. We anticipate
that the methodology described in this work will be useful in visualizing,
analyzing, and understanding the geometric diversity in larger MOF
data sets, to which we will dedicate future work.
Computational
Details
Our code imports functionality from Pymatgen (the
version used for the present work was 2020.12.31),[41] the atomic simulation environment (ASE)
(version 3.19.0),[73] and NetworkX (version 2.3)[47] for structure processing
tasks. The SOAP implementation is imported from DScribe (version 0.4.0).[74] The Scikit-learn (version
0.21.3)[75] implementation of MDS and t-SNE
and the stand-alone UMAP implementation (version 0.4.2) are used for
dimensionality reduction. The Scikit-learn implementation of affinity
propagation was used for cluster analysis. References for the algorithms
themselves are given in the main text.All structures were uniformly
scaled to a minimum r(A−B) bond distance of
unity. We computed SOAP vectors using
the polynomial basis functions implemented in DScribe, with a radial cutoff of rcut = 2.5,
smoothness of σ = 0.2, and atomic neighbor density expansion
of up to nmax = 10, lmax = 9 (as in ref (39)). (Note that here we do not include units for rcut and σ because we have rescaled all structures.)In order to assist the comparison of embedding schemes in Figure , the MDS coordinates
were reflected in the x-axis and the UMAP coordinates
were rotated 90° clockwise.In terms of technical comparisons
of the different algorithms,
it is worth mentioning the relative times taken for the code to execute.
All three algorithms could be performed on a standard MacBook Pro
(1.4 GHz Quad-Core Intel Core i5 processor; 8 GB memory). The absolute
times for MDS, t-SNE, and UMAP calculations were 98, 348, and 22 s, respectively. Hence, UMAP outperforms the other
two embedding schemes for this particular purpose. It should be noted
that a particularly low learning rate (5) was chosen for t-SNE because
this was found to better capture the structure of the data (based
on visual inspection of the relative “tightness” of
clustering: smeared-out clusters can often be a sign that the algorithm
has ended before reaching convergence). For future work on larger
data sets, UMAP might therefore be preferred over t-SNE for its faster
execution time.The absolute positions of data points in the
structure maps will
depend slightly on the specific parameters chosen (and on numerical
issues), particularly for t-SNE and UMAP; however, the global trends
and interpretation of the visualizations were found to remain consistent
for different choices of embedding parameters.
Data and Code Availability
The CHIC Python package described in this work
is openly available online at https://github.com/tcnicholas/chic; the repository includes a tutorial (in the Jupyter notebook format)
for the processing and coarse-graining of an example structure. The
code is under ongoing development, and therefore, we have also deposited
a copy of the specific version used to generate the figures in the
present work at https://doi.org/10.5281/zenodo.5271082. The full data set of
coarse-grained structures is available from the same Zenodo link.
Authors: Ask Hjorth Larsen; Jens Jørgen Mortensen; Jakob Blomqvist; Ivano E Castelli; Rune Christensen; Marcin Dułak; Jesper Friis; Michael N Groves; Bjørk Hammer; Cory Hargus; Eric D Hermes; Paul C Jennings; Peter Bjerre Jensen; James Kermode; John R Kitchin; Esben Leonhard Kolsbjerg; Joseph Kubal; Kristen Kaasbjerg; Steen Lysgaard; Jón Bergmann Maronsson; Tristan Maxson; Thomas Olsen; Lars Pastewka; Andrew Peterson; Carsten Rostgaard; Jakob Schiøtz; Ole Schütt; Mikkel Strange; Kristian S Thygesen; Tejs Vegge; Lasse Vilhelmsen; Michael Walter; Zhenhua Zeng; Karsten W Jacobsen Journal: J Phys Condens Matter Date: 2017-03-21 Impact factor: 2.333
Authors: Benjamin A Helfrecht; Rocio Semino; Giovanni Pireddu; Scott M Auerbach; Michele Ceriotti Journal: J Chem Phys Date: 2019-10-21 Impact factor: 3.488
Authors: Felix Musil; Andrea Grisafi; Albert P Bartók; Christoph Ortner; Gábor Csányi; Michele Ceriotti Journal: Chem Rev Date: 2021-07-26 Impact factor: 60.622