Literature DB >> 35966284

Visualization and Quantification of Geometric Diversity in Metal-Organic Frameworks.

Thomas C Nicholas¹, Eugeny V Alexandrov^2,3,4, Vladislav A Blatov^2,3, Alexander P Shevchenko^2,4, Davide M Proserpio^5,2, Andrew L Goodwin¹, Volker L Deringer¹.

Abstract

With ever-growing numbers of metal-organic framework (MOF) materials being reported, new computational approaches are required for a quantitative understanding of structure-property correlations in MOFs. Here, we show how structural coarse-graining and embedding ("unsupervised learning") schemes can together give new insights into the geometric diversity of MOF structures. Based on a curated data set of 1262 reported experimental structures, we automatically generate coarse-grained and rescaled representations which we couple to a kernel-based similarity metric and to widely used embedding schemes. This approach allows us to visualize the breadth of geometric diversity within individual topologies and to quantify the distributions of local and global similarities across the structural space of MOFs. The methodology is implemented in an openly available Python package and is expected to be useful in future high-throughput studies.

Entities: Chemical

Year: 2021 PMID： 35966284 PMCID： PMC9367000 DOI： 10.1021/acs.chemmater.1c02439

Source DB: PubMed Journal: Chem Mater ISSN： 0897-4756 Impact factor: 10.508

Introduction

A cornucopia of experimentally determined crystal structures is described in continually expanding databases.[1,2] These databases are now reaching sufficient sizes to provide an opportunity for extracting structure–property relationships based on data mining and machine learning (ML), in principle.[3−5] Establishing these relationships, however, is a nontrivial task because there are multiple ways by which crystal structures can be represented, compared, and analyzed. This challenge is particularly acute for metal–organic frameworks (MOFs) where the diversity of both the metal centers (“nodes”) and the organic linkers gives rise to considerable structural complexity.[6−9] One of the ways in which MOFs are commonly described is in terms of their topology; that is, the connectivity between nodes and linkers. Topological analysis routines are well-established and are implemented in automated computer packages, such as ToposPro(10) and Systre,[11] and have been found to be useful predictors for a number of material properties. For example, by considering the deformability afforded by a given topology (i.e., the ability of a framework to distort geometrically without disrupting the net connectivity), one may predict the rigidness of a framework,[12] the tendency to interpenetrate,[13] and elastic properties.[14,15] Ultimately, by exploiting knowledge of how a given choice of the node and linker will give rise to a particular topology, one can hope to design new MOFs.[16−18] However, two MOFs with the same topology may have very different network geometries. By the latter term, we mean the spatial arrangement of local atomic environments in a MOF, including variations in bond lengths, angles, and longer-range ordering—all of which may affect material properties. For example, positive and negative thermal expansion can be switched in MOFs of a given topology simply by varying their geometry.[19] By their very nature, however, each of the established geometric descriptors will cover only individual aspects of the structure; two MOFs with similar metal–linker distances might have different porosities, for example. Atom-density-based representations offer an alternative means of quantifying the geometric similarity of crystal structures.[20−22] One class of such metrics originated in the field of ML for physics and chemistry applications, where the development of structural descriptors for the atomistic structure is one of the central research tasks.[23−27] In particular, the smooth overlap of atomic positions (SOAP) descriptor was developed initially in the context of fitting machine-learned interatomic potentials.[28] Subsequently, it was shown how similarity kernels based on the SOAP formalism may be used to analyze the structural similarity for molecular and bulk periodic structures.[29] By coupling to ML techniques such as dimensionality reduction and data clustering, one can begin to navigate complex configuration spaces[30−32] and search for underlying structure–property relationships.[33−38] We have recently demonstrated that a combination of coarse-graining, rescaling, and SOAP analysis enables the geometric comparison between very different classes of materials, exemplified using a database of the AB2 hybrid and inorganic networks.[39] Other studies have emphasized the usefulness of unsupervised[31] and supervised[5] ML for MOFs, and very recently, local coordination environments were used as features for predicting oxidation states in these materials.[40] Here, with a view to facilitate further quantitative studies of the geometric structure and structure–property relationships in MOFs, we describe a generalized coarse-graining approach for such purposes and its implementation in an openly available Python package, which we call CHIC (“Coarse graining of Hybrid and Inorganic Crystals”)—expanding widely on the initial work in ref (39). Using a curated test set of four-connected AB2 coordination networks, we validate structure outputs of our implementation using the well-established ToposPro topology analysis and then discuss examples of structural and chemical analysis that are enabled by our approach.

Methodology

The computational methodology can be separated into two stages: structure processing and structure analysis (Figure ). Each stage will be discussed in turn in the following subsections. References to functions contained within the code will be highlighted in typewriter typeface throughout. We summarize the main aspects here; a separate tutorial and a practical demonstration of using the code can be found in an interactive Jupyter notebook (as detailed in the Data and Code Availability section).

Figure 1

Methodology for processing and analyzing MOF structure data sets using a cg-SOAP-based approach (cf. ref (39)). Left: structures from the experimental literature may first need to be “cleaned-up” for analysis. Here, as in ref (39), ZIF-8 is used as an example: the H-disorder is resolved using average positions of half-occupancy sites, and guest atoms in the pores are removed. The structure is then reduced to its constituent building blocks, viz., 12 Zn2+ A sites (purple) and 24 methyl imidazolate B sites; the latter sites are coarse-grained using connectivity graphs (here, identifying the five-membered imidazolate rings as the B-site centers, yellow). Right: the SOAP kernel is used to determine the similarity between all coarse-grained and rescaled structures in the database, and the result is visualized using dimensionality reduction. The resultant 2-D structure map (sketched here in a purely schematic way) may then be interpreted by correlating the locations of data points with structural and physical properties.

Structure Processing

The main structure-processing routines of our implementation are handled by a Python class called Structure. It uses Pymatgen(41) to parse and store information in the crystallographic information file (CIF) format. All subsequent structure-processing tasks are then callable as attributes of the Structure class as follows.

sort_sites()

Initially, the atomic species are classified into two categories: A (metal) sites and “other” sites. In the absence of user-specified categories, elements are sorted according to the IUPAC International Chemical Identifier classification:[42] metals are assigned to A sites and nonmetals are assigned to “other”. In the present work, we focus on the geometric structure of the underlying nets (as opposed to also considering the energetic stability provided by guest ions and molecules), and therefore, we remove nonframework species. Alkali and alkaline-earth metals (excluding Li and Be) are often nonframework atoms and are therefore removed during the species sorting algorithm (and ultimately from the final coarse-grained structure). B, Si, and P are also sorted on a case-by-case basis because there are many instances in which they behave as either A or “other” sites (e.g., boron is an A site in boron imidazolate frameworks[43] but a nonframework-site atom when a part of a BF4– anion). Therefore, if the element is bound to O or N—indicative of Lewis-acidic-type behavior—it is assigned as an A site. It should be noted that such modifications to the sorting algorithm are made based on the chemical nature of the data set being studied and serve primarily to minimize the requirement for user input during the coarse-graining procedure. The modifications may therefore be amended or removed in future work, without affecting the conclusions herein.

repair_disorder()

Site disorder occurs frequently in MOFs, and it is commonly modeled crystallographically using split sites with partial occupancies.[44] To enable structure-processing tasks that require discrete atomic positions, partially occupied (or disordered) sites are simplified using three algorithms, callable with the Structure.repair_disorder() method. First, atoms of the same species found within a given distance range (0–0.7 Å by default) are replaced by an atom of the same type, situated on the average of their positions. The remaining two routines deal with delocalized electron density associated with guest species within framework pores, which is often modeled by clusters of (fictitious) oxygen atoms. These clusters are removed by identifying either oxygen atoms that only have oxygen nearest neighbors or oxygen atoms that have no neighbors at all within a set cutoff radius. Not all disorder needs to be resolved for the general coarse-graining procedure to work; the minimum requirement is that individual molecules (e.g., organic ligands) do not overlap. However, unresolved disorder may affect the placement of the discrete place-holder atoms (e.g., by skewing the centroid of a given molecule).

reduce()

Each unique building block fragment, or ligand, is then identified by a “nearest-neighbor crawl” algorithm by calling the Structure.reduce() method. The default neighboring-site routine used is CrystalNN,[45] as implemented in Pymatgen.[41] In short, CrystalNN uses Voronoi decomposition to assign weights to an initial list of neighbors within a hard cutoff radius and then normalizes and assigns probabilities to each unique weighting according to a smooth cutoff function. With this definition, a building block fragment list is initialized with a given atom from a given building block type, and if another atom is both a nearest-neighbor and of the same building block type, it is added to the list. These neighbors are then searched for their respective nearest-neighbors, and the process is iterated (thereby “crawling” round the ligand) until the list size converges. Once converged, the atomic species, positions, and connectivity (in the form of a connectivity graph) of the fragment are stored as a buildingUnit class instance and appended to the Structure.units class attribute. The algorithm repeats until all atoms have been classified.

coarse_grain()

To coarse-grain the building blocks, a discrete bonding center must first be defined. There is seldom a unique choice for this. The simplest definition is to take the geometric centroid of all of the atomic positions in the building unit (i.e., not weighted by the atomic masses). This parallels the methodology used in both crystal net determination (often referred to as “equilibrium” or “barycentric” placement)[11] and coarse-grained molecular dynamic simulations[46] and is the definition used in the present work. The buildingUnit class stores the connectivity of a given building unit as a graph object using the NetworkX package,[47] thereby enabling alternative bonding centers to be defined. These alternative definitions are not used in the present work but are explored in a Jupyter notebook that is provided with the code (see the Data and Code Availability section). The building-unit connectivity is defined as the number of nearest-neighbor atoms of a different site type which are connected to a given fragment. All fragments with connectivity >1 (thereby distinguishing nonframework species, such as solvents, from the B sites) are coarse-grained by placing a dummy atom at the chosen bonding center and removing all other atoms. Finally, all atoms of a given building-block type are assigned to the same atomic species. The processed structures can then be output in the CIF format.

Structure Analysis

Studying large, complex data sets requires generalized analysis routines. At the core of our analysis class, StructureMap, structures are compared using SOAP[28] which we apply to coarse-grained structural models (indicated by “cg-SOAP”).[39] Using dimensionality reduction algorithms, configurations can be visualized and the relationships between them then analyzed.

cg-SOAP

The SOAP kernel measures the similarity of pairs of atomic environments.[28,29,32] Formally, for each atom, α, an atomic density, ρα(r), is constructed with a sum of Gaussians of broadness σ, centered on each neighbor, β, of α (and on α itself): where fcut denotes a cutoff function. The SOAP kernel is then defined as the overlap integral of any two neighbor densities, integrated over all three-dimensional rotations R̂,[28]where the exponent is typically set to n > 1 to retain angular information.[29] In practice, it is computationally more efficient to expand the atomic density in a set of orthogonal radial basis functions and spherical harmonics up to a given nmax and lmax. The resulting combination coefficients on their own do not yet ensure rotational invariance (because all spherical harmonics with l > 0 depend on the angular orientation) and are therefore collected into a power spectrum vector. The SOAP kernel may then be calculated by taking the normalized dot product of the two power spectrum vectors associated with each atomic environment, raised to an exponent ζ which serves to accentuate the distinction between the two environments.[28] As shown in ref (48) for elemental structures and in ref (39) for a range of inorganic and hybrid materials, geometric similarity may be assessed using SOAP for uniformly rescaled structures, enabling direct comparison irrespective of characteristic A–B distances. We have implemented two scaling approaches: either scaling to a uniform minimumr(A–B) distance or scaling to a uniform averager(A–B) distance. To extend the similarity measure beyond comparing individual atomic environments, similarities between pairs of atoms in each of the crystal structures are calculated:where α (β) runs over all atomic sites, , in the unit cell of structure i (j), respectively. Variations of this method are also implemented where the atomic sites considered are restricted to a given site type (e.g., A sites), thereby shifting the focus of the similarity analysis toward those particular sites. (In this case, information regarding the other site types is still implicitly encoded through the neighbor densities.)

Dimensionality Reduction and Visualization

To interpret (cg-)SOAP analysis results, the data set is often visualized as a two-dimensional projection.[29,32,39] A large number of algorithms are available to carry out this projection (or “embedding”), and a central aspect of the present work will be to compare different widely used embedding schemes. Our implementation stores the similarity of all structures with one another in the form of a symmetric similarity matrix, K, which we construct using the per-cell averaged similarity (eq ), viz., . The corresponding geometric distance matrix, D, may also be defined with elementsto satisfy the triangle inequality.[29] We currently provide interfaces to the following dimensionality reduction algorithms implemented in external code packages: multidimensional scaling (MDS),[49] t-distributed stochastic neighbor embedding (t-SNE),[50] and the uniform manifold approximation and projection (UMAP).[51]

Bonding and Properties

During the reduction of MOF structures to their coarse-grained representations, a dictionary of bonds between building units is stored, as recently defined in the IUCr topology dictionary (topoCIF). This enables the CIF output to contain the requisite information to readily construct the underlying net,[52,53] that is, the net of building units, and to calculate its topological descriptors (e.g., using ToposPro). It also enables the calculation of local-environment properties, including bond lengths, angles, and order parameters. We use a module, referred to as bonding, to extract this geometric information and calculate Chau–Hardwick tetrahedral order parameters and Steinhardt bond order parameters.[54−56] This module might also be extended for other custom analyses. Two global structure properties are also included in the routine structure analysis. The first is A-site density, an important material descriptor when considering the potential void space present in a framework. The second property is the A-site SOAP heterogeneity introduced in previous work,[39] which measures the diversity of the A-site environments in a given structure. A value of zero means that all A-site environments are geometrically equivalent (up to the SOAP cutoff radius); a higher value indicates greater diversity. It is calculated aswhere P is the set of (unordered) pairs of distinct A-site environments in the structure, is the number of A-site atoms in the structure, and k is the SOAP similarity kernel defined in eq .

Database Details

The data set curated for this work expands upon the AB2 study reported in ref (39), now focusing on analyzing the diversity in the wider set of AB2 MOFs with two-connected ligands. Restricting the study to a single coordination formula and ligand connectivity enables careful validation of the reported coarse-graining methods and a thorough examination of the results; in particular, being able to relate trends in our configuration space to local geometric properties of individual structures serves as a useful tool for understanding what information is captured in the cg-SOAP approach. Primary data were selected from the sample that was prepared for ref (13), filtering for all structures in which the coordination formula was AB2 and the ligand (B) was two-connected. The sample contained 1160 crystal structures, to which 102 structures from the CSD 5.42 update 1 (Feb 2021) were added. A complete list of the 1262 crystal structures and their topological descriptors is provided in Supporting Information (see the file CF_A_B2_1262.xlsx). The “experimental literature” to be processed was exported as CIFs from the CSD using the 1262 entry refcodes and processed unchanged.

Results and Discussion

Visualizing Geometric Diversity

We begin by visualizing the geometric diversity in the AB2 MOF data set, here embedding cg-SOAP similarities for all (uniformly scaled) structures in two dimensions. Dimensionality reduction is an unsupervised ML task, aiming to extract information from unlabeled data sets, for example, by clustering the data into groups. We start by visualizing the data set using one of the simplest algorithms, MDS (as used in ref (39)), in Figure . The structure map carries the intuitive interpretation that structures that are similar appear close together and structures that are dissimilar are further apart. In the context of MOFs, structures are generally reported with their topological identifier which uniquely identifies the underlying net. We highlight the distribution of the three most commonly occurring topologies in the data set—the diamond-like dia net (189 structures), the sodalite-like sod net (136 structures), and the fourfold interpenetrated 4#dia net (116 structures)—to investigate the degree to which topological variation is represented in the cg-SOAP representation. Strikingly, all three topological families have structures dispersed over the map, with the dia topology being most widely distributed among the three (light blue). There are regions of overlapping points, particularly for dia and sod in the lower part of Figure , indicating that topologically dissimilar MOFs might in fact show similarities in terms of their geometric structure.

Figure 2

Geometric diversity within isoreticular groups of AB2 MOFs. The graph shows a two-dimensional visualization of the cg-SOAP-based structural distances using an MDS embedding: generally, the closer the two points are, the more similar their coarse-grained and rescaled geometric structures are. The distribution of the three most commonly occurring topologies within our data set (dia, sod, and fourfold-interpenetrated dia, denoted as 4#dia) is emphasized by lines that originate from the respective centroid. Data set entries with different topologies than the three aforementioned ones are all represented by gray points. We quantify the relative distributions of each topology in the map based on how far, on average, the structures of that group are from their respective centroid. These relative distributions are 0.771, 0.386, and 0.441 for dia, sod, and 4#dia, respectively. For ease of comparison, we normalize the relative distributions such that the value for dia is unity (Table ).

Table 1

Characteristics of the Ten Most Commonly Occurring Topologies in the Data Set

		geometric densitya	relative distribution
topology	occurrences	(mean ± SD)	MDS	t-SNE	UMAP
dia	189	125 ± 56	1 (reference)
sod	136	80 ± 26	0.50	0.45	0.60
4#dia	116	261 ± 77	0.57	0.47	0.44
5#dia	95	271 ± 87	0.67	0.47	0.50
3#dia	88	206 ± 76	0.71	0.48	0.45
2#dia	79	182 ± 82	0.91	0.76	0.55
cds	41	221 ± 87	0.81	0.54	0.44
qzd	38	156 ± 15	0.12	0.18	0.26
6#dia	36	336 ± 137	0.74	0.55	0.47
3#dmp	33	292 ± 49	0.38	0.23	0.31

The geometric density is here defined as n(A) × 1000/Vscaled, where n(A) is the number of A sites in the unit cell and Vscaled is the volume of the scaled unit cell.

The geometric density is here defined as n(A) × 1000/Vscaled, where n(A) is the number of A sites in the unit cell and Vscaled is the volume of the scaled unit cell. We see that dia networks can be formed with a much larger variety of geometries compared to sod; this is in accordance with the conclusions derived from an analysis of the natural tiles (cages).[13] This reflects a larger deformability of diamondoid structures and adaptability to building blocks with geometries spanning a wide range of volumes, lengths, and angles. This feature promotes the dominance of the dia topology in coordination networks.[13] The folding of the dia networks into interpenetrating arrays, however, significantly restricts the diversity of acceptable network geometries. Cluster analysis can assist the interpretation of complex data sets by grouping data and identifying representative examples, from which patterns can more readily be identified. Affinity propagation is a clustering algorithm that views each data point as a node in a network and recursively minimizes the edge weights between nodes; the magnitude of each edge at a given time reflects the current affinity the point has for selecting the second point as its “examplar”.[57]Figure a shows the dia structures (light blue data points in Figure ) now divided into four clusters, with the exemplar coarse-grained structures visualized. From these clusters, we investigate the distribution of structural properties in different regions of the map in order to appreciate the geometric diversity available within the dia topology. In the context of MOFs, low metal densities are the simplest indication for the presence of void space: an important feature for catalytic applications. We define the geometric density as the (unitless) density of the structures that have been coarse-grained and scaled (to unity minimum r(A–B) bond length). The geometric density is related to the experimental metal density by the characteristic framework bond length. We plot the distribution of the geometric density, average angular component of the Chau–Hardwick order parameter, Sg (a value of zero corresponds to ideal tetrahedral bond angles; a value of unity would correspond to the extreme case where all four bonds are superimposed), and the average A−B−A angle in Figure b–d, respectively.

Figure 3

Geometric diversity within MOFs of dia topology. (a) From the data set characterized in Figure , we isolate the dia entries and analyze their distribution using a clustering algorithm, viz., affinity propagation.[57] For each cluster, the algorithm selects an “examplar” datapoint, and the corresponding coarse-grained structures and their CSD refcodes are shown. Distributions of local properties for each cluster are presented on the right-hand side: (b) the geometric density, (c) the average angular component of the Chau–Hardwick order parameter, Sg, and (d) the average A−B−A angle. Throughout this paper, box plots are drawn such that boxes range from the 25th to the 75th percentile, with the median indicated by a horizontal line; whiskers span ±1.5 times the interquartile range, and points outside this range are plotted with circle markers. Figure b shows the distribution of geometric density for each cluster. There is a relatively clear separation between each group, with cluster 2 (orange) containing notably more dense frameworks. It is interesting to note that clusters 2 and 4, which have higher average geometric density, also seem to have a broader distribution across the MDS map, relative to the narrow distributions of clusters 1 and 3. This could, in part, be explained by the relatively high deviation away from ideal tetrahedral bond angles about the A sites, as evident from the plot of the angular Chau-Hardwick tetrahedral order parameter, Sg (Figure c). Figure d shows the distribution of the average A−B−A bond angle; again, clusters 2 and 4 demonstrate broad distributions with lower average values. We extend the analysis of the dia subset of structures using parameters extracted using ToposPro;[53] in particular, the tile average distortion[13] supports the results illustrated in Figure . Clusters with higher average geometric density have larger distortions in the tetrahedral coordination of the A sites, corresponding to a collapse of the tiles. Conversely, the most porous structures have A site coordination environments closer to an ideal tetrahedron and the largest proportion of tiles close to the adamantane tile of the ideal dia net. Analogously, structures with A−B−A angles less than 150° also correspond to denser structures due to the collapse of tiles. Figure emphasizes that a single topological classification may give rise to a geometrically diverse set of structures; that is to say, one cannot necessarily predict the geometric features of a structure from its topological label alone. From Figure , we therefore infer that the study of the latent geometric configuration space may provide insights into a different set of material properties (e.g., bulk modulus) compared to those which are accounted for by the topology (e.g., porosity[13]).

Embedding Schemes

The universal aim of dimensionality reduction algorithms is to capture a meaningful structure in high-dimensional data when embedded into low dimensions. However, it is imperative to consider the algorithm methodology when interpreting the structure map. To illustrate this point, we have visualized our data set with three dimensionality reduction algorithms; namely, MDS,[49] t-SNE,[50] and UMAP[51] (Figure ). Our principal aim is to demonstrate that the interpretation of our cg-SOAP representations is invariant to the specific embedding scheme chosen; indeed, there are other algorithms available (e.g., kernel principal component analysis[58] and variants thereof[59]) that are not considered in this study but could be used in future work for alternative and/or complementary interpretations of the configuration space. In order to understand how each representation differs from one another, we color-code the map by the geometric density (Figure a) and by the distributions of dia, sod, and 4#dia topologies (Figure b, cf. Figure ), and compare side-by-side how these properties are represented in the results of different embedding schemes.

Figure 4

Visualizing geometric diversity in MOFs using different embedding schemes. The plots compare the results of multidimensional scaling (MDS,[49]left), t-distributed stochastic neighbor embedding (t-SNE,[50]center), and uniform manifold approximation and projection (UMAP,[51]right). For each embedding scheme, we show two-dimensional structure maps characterizing the coarse-grained and scaled AB2 MOF data set, colored (a) by geometric density, and (b) with the distribution of the three most commonly occurring topologies in the data set, dia, sod, and 4#dia, highlighting the respective centroids as in Figure . All three embeddings demonstrate a strong correlation of the distribution of points with geometric density and qualitatively similar trends in the relative distributions of topologies within each structure map (Table ). The detail, however, varies. MDS leads to a distinct region in the center of the map where no structures are found. This perhaps suggests that this algorithm, which seeks to minimize a relatively simple loss function, is reaching its limits of usefulness for the size and complexity of the data set considered here. The “island-like” features in both the t-SNE and UMAP representations—or more specifically, the absence of islands in the MDS representation—further emphasize this point. The most prominent island (right-hand side of the t-SNE map and upper-right of the UMAP map) is a family of qzd MOFs, with the smallest relative distribution, among the 10 most commonly occurring topologies in the data set, for MDS and t-SNE (Table ). Manual inspection of the structures reveals a high frequency of similar CCDC reference codes (refcodes). The refcode system is designed to group together structures into families, such as the same compound having been crystallized and characterized under different conditions, or polymorphs of the same compound. There are two families of refcodes within the qzd MOF cluster: LIWDEB (13 occurrences) and UKUVOL (21 occurrences).[60,61] All of these structures are one of two isomers of [{Cu(succinate) (4,4′-bipyridine)}], isolated under different experimental conditions, and thus could be considered duplicates. The high geometric similarity between duplicates relative to the similarity between other structures in the data set skews the representation toward creating an isolated cluster.

Islands and Duplicates

The occurrence of “islands” of structures, separated out near the edge of the t-SNE and UMAP representations, requires a more subtle interpretation. For example, one of the islands to the left of the UMAP map corresponds to structures with zni and coi topologies (α and β polymorphs of Zn(Im)2, respectively), predominantly classified within the IMIDZB refcode family. “Duplicates” cannot necessarily be identified as those with a common refcode; however, cg-SOAP screening can help automate this procedure. To illustrate this point, we isolate the IMIDZB family of structures from the database, create a structure map using t-SNE (because this algorithm generally achieves clearer clustering of data into distinct regions, albeit at the cost of meaningful intercluster distances), and analyze the representation using affinity propagation, as shown in Figure .

Figure 5

Using cg-SOAP analysis to identify relationships between individual CSD entries with a single refcode. The example case here is given by the polymorphs of Zn(Im)2, with CSD refcodes IMIDZB, suffixed with numbers representing a running index. The geometric diversity within the IMIDZB refcode family is visualized using t-SNE embedding and a cluster analysis by affinity propagation. The structures are labeled according to topology. The zni polymorphs cluster together; it could therefore be beneficial to choose a representative structure from this cluster. The spread of the cag topology into two distinct clusters illustrates the complexity of “duplicate” detection, as discussed in the text. The IMIDZB refcode family contains different polymorphs of Zn(Im)2 with different characteristic geometries and topologies, which therefore separate in the cg-SOAP map. By analyzing the geometric diversity in this smaller configuration space, we propose an automated “duplicate” structure identification procedure. The zni topology is the densest, most stable crystalline polymorph of Zn(Im)2, and all structures of this connectivity are found in the same cluster, labeled 1. One structure (e.g., the cluster examplar selected by the affinity propagation algorithm) could be taken as representative of this particular polymorph. The distribution of cag frameworks across two clusters (2 and 3), however, is an example where frameworks with identical composition, connectivity, and space group display diverse geometries, whereas IMIDZB11 corresponds to the desolvated ZIF-4 framework under ambient conditions (298 K, 1 atm) and IMIDZB12 and IMIDZB15 are the same frameworks after decreasing temperature (80 K) and increasing pressure (0.15 GPa), respectively, exemplifying the “breathing” effect in the frameworks.[62,63] With the change in external stimuli, the frameworks become more dense: the average A−B−A angle decreases and Sg increases, corresponding to a lowered “tetrahedrality” around the Zn (A) sites. On these grounds, it may be desirable to keep one structure from each cluster, that is, IMIDZB10 (cluster 2) and IMIDZB12 or IMIDZB15 (cluster 3), in order to capture the geometric diversity fully. More generally, one could propose an algorithm that classifies duplicates by considering the refcode, topology, and cg-SOAP similarity as an automated approach that makes it possible to preserve the subtle geometric diversity that arises from varying experimental conditions. The screening of duplicates is expected to be helpful (and indeed required) for moving to very large databases in the future, as demonstrated in a recent analysis of DFT-optimized data sets of MOFs.[64]

Quantifying Geometric Diversity

Our approach also enables quantitative investigation of local structural properties and how they are distributed for different categories of structures (e.g., topology). For example, interpenetration is commonly found in MOFs, and it holds implications for the potential functionality of a given compound because it is closely related to porosity.[65] Generally, porous materials minimize the energy of the framework through optimal filling of void space, and thus, in cases where void space is of sufficient size, interpenetration may be observed. Controlling the degree of interpenetration has been explored using subtle changes in the synthetic methodology, such as varying reaction conditions, templating agents, and ligand design.[66,67] Given the increasing number of MOF crystal structures reported, the question arises as to whether we can postrationalize the extent to which the local geometry influences the tendency to interpenetrate. In Figure , we show the distributions of local properties for each structure, for different degrees of interpenetration of the diamond-like net (Z = 1 corresponds to dia MOFs, Z = 2 corresponds to 2#dia, and so on). In Figure a, we show the distribution of A-site heterogeneity values (eq ). Figure b,c illustrates the distributions of established local property descriptors; namely, Sg and the average A−B−A angle, respectively.

Figure 6

Distributions of geometric property indicators for increasing degrees of interpenetration, Z, of the dia-based MOFs in the data set. The figure shows (a) A-site heterogeneity (eq ), (b) average angular component, Sg, of the Chau–Hardwick order parameter, and (c) average A−B−A angle. We note that the majority of dia MOFs have locally homogeneous A sites, and this homogeneity does not appear to substantially depend on the degree of interpenetration (Figure a). Similarly, the distribution of the average Sg does not show a clear correlation with the degree of interpenetration (Figure b). There is, in contrast, a much stronger correlation with the A−B−A angle: as the degree of interpenetration increases, the average A−B−A angle tends toward 180° (Figure c). This makes intuitive sense when one considers that longer, “rod-like” ligands gives rise to greater void space and therefore enable a greater degree of interpenetration, which is consistent with the synthetic approach of employing longer spacer ligands to target higher degrees of interpenetration.[68−70] It may be inferred from these distributions that the A sites maintain a similar environment, irrespective of the degree of interpenetration, whereas the linker geometry plays a crucial role in determining this property. Finally, we extend this quantitative analysis to all topologies that occur at least five times in the data set, plotting the distribution of A-site heterogeneity values in the respective structures in Figure .

Figure 7

Distributions of A-site heterogeneity (eq ) for all structures in the data set for which the topology occurs ≥5 times. Three example MOFs are highlighted with coarse-grained structural representations, CSD refcodes, and the corresponding topology symbol and are discussed in the text. The qzd structures have a distinctly narrow distribution of A-site heterogeneity values, which reinforces the hypothesis that the separation from the main body of structures in the t-SNE and UMAP embeddings (Figure ) is a skewing of the visualization as a result of duplicate structures. It is also interesting to note that the zni topology has a narrow A-site heterogeneity distribution at a relatively high average value, which likely contributes to the separation of the zni structures. Figure highlights examples of topologies with particularly low (qzd) and particularly high (4T13, pts, and 2#pts) A-site heterogeneity, for which illustrative coarse-grained and scaled crystal structures are visualized. The structures with pts and 2#pts topology (which are related by increasing from a single to twofold interpenetrated net) all contain both tetrahedral and square planar geometries about different A sites, typically by combining Zn (tetrahedral geometric preference) with any of Ni, Cu, Pt, or Pd (square-planar geometric preference). Some frameworks have Cu in both the tetrahedral and square-planar sites of the pts framework. When combined with Au, Cu/Ag occupy the tetrahedral sites in the 2#pts framework. One might consider attempting to target these pts topologies, therefore, by selecting metals with the appropriate geometric preferences demanded by the framework. The high degree of heterogeneity in the 4T13 frameworks can be ascribed to a tension between the differing lengths of two organic linkers, on the one hand, and the additional flexibility afforded using two linkers, on the other hand (cf. DEBWAK in Figure ). The 4T13 frameworks have a short linker (e.g., isophthalate in DEBWAK) and a long-chain, flexible organic linker (e.g., N,N′-bis(pyridin-4-yl)-2,2′-bipyridine-5,5′-dicarboxamide in DEBWAK).[71,72] The resultant framework has 1-D helical chains that cross each other to create a 2-D molecular braid with geometrically distinct Zn sites. Automated, quantitative analyses such as those exemplified in Figures and 7 should be a helpful part of the methodology used in future work for understanding the geometric and structural diversity in databases of materials.

Conclusions

We have studied the structures of AB2 MOFs containing a diverse set of two-connected organic linkers. By coupling a cg-SOAP approach to different embedding schemes, we have visualized and analyzed the geometric diversity in a database of MOF structures. With the aid of cluster analysis, the structure maps of the AB2 MOF configuration space can be better understood. Here, we have focused on clustering within the low-dimensional embedding and demonstrated how the location of structures in the map was consistent with the grouping of structures with similar structural properties. We described the cg-SOAP and visualization methodology implemented in a Python package and validated the routines by confirming that the underlying net was correct, using ToposPro software. We anticipate that the methodology described in this work will be useful in visualizing, analyzing, and understanding the geometric diversity in larger MOF data sets, to which we will dedicate future work.

Computational Details

Our code imports functionality from Pymatgen (the version used for the present work was 2020.12.31),[41] the atomic simulation environment (ASE) (version 3.19.0),[73] and NetworkX (version 2.3)[47] for structure processing tasks. The SOAP implementation is imported from DScribe (version 0.4.0).[74] The Scikit-learn (version 0.21.3)[75] implementation of MDS and t-SNE and the stand-alone UMAP implementation (version 0.4.2) are used for dimensionality reduction. The Scikit-learn implementation of affinity propagation was used for cluster analysis. References for the algorithms themselves are given in the main text. All structures were uniformly scaled to a minimum r(A−B) bond distance of unity. We computed SOAP vectors using the polynomial basis functions implemented in DScribe, with a radial cutoff of rcut = 2.5, smoothness of σ = 0.2, and atomic neighbor density expansion of up to nmax = 10, lmax = 9 (as in ref (39)). (Note that here we do not include units for rcut and σ because we have rescaled all structures.) In order to assist the comparison of embedding schemes in Figure , the MDS coordinates were reflected in the x-axis and the UMAP coordinates were rotated 90° clockwise. In terms of technical comparisons of the different algorithms, it is worth mentioning the relative times taken for the code to execute. All three algorithms could be performed on a standard MacBook Pro (1.4 GHz Quad-Core Intel Core i5 processor; 8 GB memory). The absolute times for MDS, t-SNE, and UMAP calculations were 98, 348, and 22 s, respectively. Hence, UMAP outperforms the other two embedding schemes for this particular purpose. It should be noted that a particularly low learning rate (5) was chosen for t-SNE because this was found to better capture the structure of the data (based on visual inspection of the relative “tightness” of clustering: smeared-out clusters can often be a sign that the algorithm has ended before reaching convergence). For future work on larger data sets, UMAP might therefore be preferred over t-SNE for its faster execution time. The absolute positions of data points in the structure maps will depend slightly on the specific parameters chosen (and on numerical issues), particularly for t-SNE and UMAP; however, the global trends and interpretation of the visualizations were found to remain consistent for different choices of embedding parameters.

Data and Code Availability

The CHIC Python package described in this work is openly available online at https://github.com/tcnicholas/chic; the repository includes a tutorial (in the Jupyter notebook format) for the processing and coarse-graining of an example structure. The code is under ongoing development, and therefore, we have also deposited a copy of the specific version used to generate the figures in the present work at https://doi.org/10.5281/zenodo.5271082. The full data set of coarse-grained structures is available from the same Zenodo link.

49 in total

1. Accurate determination of crystal structures based on averaged local bond order parameters.

Authors: Wolfgang Lechner; Christoph Dellago
Journal: J Chem Phys Date: 2008-09-21 Impact factor: 3.488

2. Secondary building units, nets and bonding in the chemistry of metal-organic frameworks.

Authors: David J Tranchemontagne; José L Mendoza-Cortés; Michael O'Keeffe; Omar M Yaghi
Journal: Chem Soc Rev Date: 2009-03-31 Impact factor: 54.564

3. Atom-density representations for machine learning.

Authors: Michael J Willatt; Félix Musil; Michele Ceriotti
Journal: J Chem Phys Date: 2019-04-21 Impact factor: 3.488

4. The atomic simulation environment-a Python library for working with atoms.

Authors: Ask Hjorth Larsen; Jens Jørgen Mortensen; Jakob Blomqvist; Ivano E Castelli; Rune Christensen; Marcin Dułak; Jesper Friis; Michael N Groves; Bjørk Hammer; Cory Hargus; Eric D Hermes; Paul C Jennings; Peter Bjerre Jensen; James Kermode; John R Kitchin; Esben Leonhard Kolsbjerg; Joseph Kubal; Kristen Kaasbjerg; Steen Lysgaard; Jón Bergmann Maronsson; Tristan Maxson; Thomas Olsen; Lars Pastewka; Andrew Peterson; Carsten Rostgaard; Jakob Schiøtz; Ole Schütt; Mikkel Strange; Kristian S Thygesen; Tejs Vegge; Lasse Vilhelmsen; Michael Walter; Zhenhua Zeng; Karsten W Jacobsen
Journal: J Phys Condens Matter Date: 2017-03-21 Impact factor: 2.333

5. Modular chemistry: secondary building units as a basis for the design of highly porous and robust metal-organic carboxylate frameworks.

Authors: M Eddaoudi; D B Moler; H Li; B Chen; T M Reineke; M O'Keeffe; O M Yaghi
Journal: Acc Chem Res Date: 2001-04 Impact factor: 22.384

9. Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity.

Authors: Nils E R Zimmermann; Anubhav Jain
Journal: RSC Adv Date: 2020-02-07 Impact factor: 3.361

10. The Cambridge Structural Database.

Authors: Colin R Groom; Ian J Bruno; Matthew P Lightfoot; Suzanna C Ward
Journal: Acta Crystallogr B Struct Sci Cryst Eng Mater Date: 2016-04-01