Senja Barthel1, Eugeny V Alexandrov2,3, Davide M Proserpio2,4, Berend Smit1. 1. Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques, Valais, Ecole Polytechnique Fédérale de Lausanne (EPFL), Rue de l'Industrie 17, CH-1951 Sion, Switzerland. 2. Samara Center for Theoretical Material Science (SCTMS), Samara University, Moskovskoe shosse 34, 443086 Samara, Russian Federation. 3. Samara State Technical University, Molodogvardeyskaya street 244, 443100 Samara, Russian Federation. 4. Dipartimento di Chimica, Università degli Studi di Milano, Via Golgi 19, 20133 Milano, Italy.
Abstract
We consider two metal-organic frameworks as identical if they share the same bond network respecting the atom types. An algorithm is presented that decides whether two metal-organic frameworks are the same. It is based on distinguishing structures by comparing a set of descriptors that is obtained from the bond network. We demonstrate our algorithm by analyzing the CoRe MOF database of DFT optimized structures with DDEC partial atomic charges using the program package ToposPro.
We consider two metal-organic frameworks as identical if they share the same bond network respecting the atom types. An algorithm is presented that decides whether two metal-organic frameworks are the same. It is based on distinguishing structures by comparing a set of descriptors that is obtained from the bond network. We demonstrate our algorithm by analyzing the CoRe MOF database of DFT optimized structures with DDEC partial atomic charges using the program package ToposPro.
A primary concern of
materials science is the discovery of new
materials and the prediction and understanding of their properties.
With steadily increasing computer power, computational studies have
become an inevitable tool for both analysis and prediction of materials.
Large databases contain not only naturally occurring[1,2] and synthesized materials but also thousands upon thousands of structures
that are generated in silico.[3−10] These databases provide the ground for computational studies, in
particular screening studies to identify interesting materials for
different applications.[3,11−15] Less known is that these databases, as we will demonstrate
below, can contain many variations of the same structure. Clearly,
one would like to avoid spending valuable resources on studying similar
structures but, more importantly, having an unspecified number of
duplicated structures will make the statistics of any screening study
unreliable. Therefore, developing a systematic methodology to identify
whether two deposited structures are duplicates not only is an important
fundamental question but also is of practical importance. This is
in particular the case if the number of structures is so large that
manual inspection is out of the question.To illustrate our
approach of comparing structures, we focus on
a popular class of materials called metal–organic frameworks
(MOFs).[16] These are potentially porous
3D, 2D, and 1D crystalline materials, which consist of metal nodes
connected by organic ligands.[17−19] MOFs have gained much attention
during the past decade due to their huge variety. By changing a metal
type or substituting the functional group of an organic linker, one
can in principle systematically change the properties of a known MOF.
This not only makes MOFs and related nanoporous materials such as
COFs, ZIFs, PPNs, etc. intriguing material classes for basic research
but also suggests them for many potential applications, ranging from
gas separation and storage to sensing and catalysis.[16,20−25]For complex compounds such as MOFs, we have to be careful
how to
define two materials as being equivalent, since similarities exist
on different levels. For example, if two crystals do not have the
same space groups or similar lattice parameters, they are considered
as different materials from a strict crystallographic viewpoint and
are listed as two separate entries in most databases. However, from
a MOF point of view two structures are considered identical if they
share the same bond network, with respect to the atom types and their
embedding: i.e., if two structures can in principle be deformed into
each other without breaking and forming bonds. We do not consider
a particular MOF as a new material after, for example, rotating a
ligand. However, such a small change can change the space group and
hence can be reported as a new material in these databases. There
exist several algorithms to compare crystals, but either they are
restricted to structures with the same space group[26−28] or they evaluate
the differences between atomic positions,[29,30] which is useful to detect small differences between crystals due
to slightly different experimental conditions. However, while the
traditional crystallographic approaches are important for solid-state
chemistry, the unit cells of porous MOFs and related materials are
much larger and are filled with solvents. This often causes substantial
deviations of the crystal parameters for the activated evacuated material
and the representatives with guest molecules,[31] and a different method is required to compare MOFs.Most synthesized
MOFs are deposited in the Cambridge Structural
Database (CSD).[32] These materials often
contain remaining solvent molecules, as do their structural files
in the CSD. If a material is experimentally obtained under changed
conditions, the remaining solvent molecules can differ, ligands can
be differently aligned, and the unit cells can be distorted with respect
to each other. All these versions of a material are stored independently
in the CSD, and different versions of one MOF can have different chemical
and physical behavior such as the narrow- and large-pore versions
of the highly flexible MIL-53. However, from a fundamental point of
view, one is often interested in understanding the properties of the
underlying framework: i.e., the material without solvent molecules
that are not believed to be part of the true framework. Before computational
studies are performed, structures are usually “cleaned”:
i.e., solvents are artificially removed and disorders often neglected.
This leads to duplications in the resulting databases since many materials,
in particular those on which considerable experimental efforts have
been spent, are reported in numerous variations: the CSD contains
for example more than 50 structures that all describe the famous CuBTC.[33] Clearly, if the number of duplicates is this
large, it will bias these databases.Another postprocess that
can cause multiple entries is relaxation:
both experimentally known and hypothetical structures are often relaxed
to obtain well-defined and energetically most stable representations
of the materials before they are studied by simulations. Since it
is impossible to ensure that an energetic minimum is global, it is
possible that different relaxations find varying local minima that
lead to multiple entries in a database.In this article, we
show how to systematically find topological
duplicates in these material databases. We demonstrate how to compare
frameworks of MOFs, but a small variation of the algorithm can also
consider other classes of materials such as molecular crystals by
considering the patterns of hydrogen bonds and van der Waals interactions.
Similarly, it is possible to distinguish different versions of flexible
MOFs by including van der Waals interactions in the bond network.
In a representative study, we analyze a subset of the so-called “computationally
ready” MOFs of experimentally known structures (CoRe MOF database):
namely, the database of 502 frameworks[33] (502 CoRe MOF database) that contains the structures of that are
relaxed using density functional theory (DFT) and to which density
derived electrostatic and chemical (DDEC) partial atomic charges are
assigned. The files stored in the CoRe MOF database are mainly derived
from the CSD by removing solvents and sometimes adding missing hydrogens.
The results are of interest in their own right, since this database
is frequently used for screening studies. Alternative databases of
cleaned MOFs can be obtained by applying the MOF detection and the
user-adopted solvent-removal algorithms that have been made available
by the CSD.[34] The prospective generation
of databases of existing and new MOFs made the development of a tool
for removing duplicates relevant and urgent.The issue of duplication
in databases is well-known. For example,
the authors of the CoRe MOF databases already eliminated some duplicates:
two cleaned CSD structures were considered equivalent if they share
the numbers and type of atoms and if the root-mean-square deviation
of the atomic positions of their Niggli cells is smaller than 0.1A.[35] While this approach is intuitive, it is neither
necessary nor sufficient to determine duplicates. Clearly, all duplicates
have the same number of atoms and atom types. However, the atomic
positions can vary largely between different representations of the
same material. Indeed, we still find many duplicates in the CoRe MOF
databases. The fundamental problem is, that allowing larger root-mean-square
deviation does not address the problem of detecting duplicates correctly.
Increasing the limit allows us to find more duplicates but also falsely
identifies more nonidentical structures as duplicates.We present
a systematic and rigorous way of distinguishing structures
that describe different materials, by introducing a set of descriptors
that each give the same value for identical structures (invariants).
Therefore, two structures with different descriptors are necessarily
different. According to our notion of equivalence, atom types and
atom numbers as well as all properties that are derived from the graph
describing the bond network are invariants. We consider the following
invariants: atom types, ligand graph, ligand coordination mode, and
properties derived from the bond network and from several of its simplified
versions, such as the dimensionality of the net, its topological indices,
and possible interpenetration. In contrast to atomic positions, symmetries,
cell parameters, volumes, or surface areas, our methodology is independent
from distortion, which makes it very robust and reliable. Our set
of invariants does not provide a complete invariant, meaning that
there might exist different structures that cannot be distinguished
by the set of descriptors. Such an example would be a pair of structures
whose bond networks were practically indistinguishable by their topological
indices (e.g., net topology, vertex symbol, point symbol[36]), which is the case for stereoisomers. Excluding
one couple of enantiomers, we have not come across an example in the
502 CoRe MOF database where our invariants wrongly identify two structures
as identical.All analyses have been performed using the software
package ToposPro.[37] We found that the 502
CoRe MOF database of 502
relaxed structures with DDEC partial atomic charges contains 48 structures
with duplicates, some of them being reported several times, leading
to 78 redundant entries. MOF-5 is the most often listed structure
with 17 entries.
Similarities of Reported Materials
Given the large number of deposited structures, it is inevitable
to use an algorithm to automatically detect similar structures and
duplicates. The results of our representative study of the 502 CoRe
MOF database, the CSD refcodes of each structure (with all bibliographic references), all chemical data, the analyses of the nets, and a list of all duplicates is given in the Supporting
Information.At present, we often rely on visual inspection
to determine whether
a newly reported crystal structure is similar to one of the existing
materials, which is, given the ever-increasing number of reported
MOF structures, close to impossible. Interestingly, even for a single
pair of frameworks visual inspection might not be sufficient to determine
with confidence whether they are identical or not. To illustrate this
point, we consider three structures with the simple composition [Li(isonicotinate)],
XUNGOD, XUNHAQ, and XUNGUJ, which contain only C, H, Li, N, and O
atomic species and which all form three-dimensional porous networks.
The experimental structures contain different solvent molecules (morpholine, N-methylpyrrolidinone, and dimethylformamide, respectively)
and have different space groups (P1, P21, and P21/n). They are reported in the same publication[38] and are correctly stored as different structures in the CSD. Figure shows a striking
similarity, and one could easily conclude that the frameworks have
the same topology.
Figure 1
XUNGOD (left) and XUNGUJ (right) in [100] projection (top)
and
[010] projection (bottom). The two frameworks have different topologies,
as can be seen by simplifying the adjacency matrix.
XUNGOD (left) and XUNGUJ (right) in [100] projection (top)
and
[010] projection (bottom). The two frameworks have different topologies,
as can be seen by simplifying the adjacency matrix.Indeed, the authors assigned to all frameworks
the same topological
type sra (with Li2O2 dimers as
4-c node), without naming it. (We use the RCSR three letter names[39] for net topologies, when available, or else
ToposPro TTD names.[40]) However, the ligands
of XUNGOD have a connectivity to the metals different from that of
the ligands of XUNHAQ and XUNGUJ. All three frameworks contain infinite
rod-shaped structural units aligned parallel to each other, which
is clearly seen after removing dangling atoms (1-c vertices) and suppressing
2-coordinated atoms (2-c vertices). (Figure ).
Figure 2
Underlying nets after simplification of 1-c
and 2-c vertices, grown
up to the fifth coordination sphere around the O atom for XUNGOD (left)
(CS: 3,7,14,26,40) and XUNGUJ (right) (CS: 3,7,14,26,42). The central
O atom is marked in red, yellow balls are vertices belonging to the
second to fourth coordination spheres, and green balls denote vertices
of the fifth coordination sphere.
Underlying nets after simplification of 1-c
and 2-c vertices, grown
up to the fifth coordination sphere around the O atom for XUNGOD (left)
(CS: 3,7,14,26,40) and XUNGUJ (right) (CS: 3,7,14,26,42). The central
O atom is marked in red, yellow balls are vertices belonging to the
second to fourth coordination spheres, and green balls denote vertices
of the fifth coordination sphere.The analysis of the coordination sequence (CS) of atoms in
the
simplified net shows that XUNHAQ and XUNGUJ have the same CS for all
atoms and share the net topology, while the CS of XUNGOD is different.
For example, the CS of the O atom differs for the fifth coordination
sphere (Figure ).
The frameworks of XUNHAQ and XUNGUJ are duplicates but are consequently
different from XUNGOD. These subtleties cannot be found by visual
inspection but are only detectable by a more sensitive graph analysis
using the simplified adjacency matrix and topological indexes such
as the coordination sequences (CS).An example of two structures
that have identical frameworks is
the pair AMILUE and AMIMEP,[41] two versions
of [Zn4(urotropin)2(2,6-naphtalenedicarboxylato)4]. They arise from a study of different framework–host
interactions: AMIMEP contains guest ferrocene molecules that are not
present in AMILUE. However, the frameworks (Figure ) are too complicated to be reliably identified
as identical by visual inspection, which is additionally hindered
by the difference in the cell parameters and a shift of the unit cells.
Figure 3
AMILUE
(left) and AMIMEP (right) in [001] (top) and [100] (middle,
bottom) projection. The cleaned frameworks are identical.
AMILUE
(left) and AMIMEP (right) in [001] (top) and [100] (middle,
bottom) projection. The cleaned frameworks are identical.Two identical frameworks of [Zn3(bpdc)3bpy]
(bpdc2– = biphenyldicarboxylate dianion, bpy = 4,4′-bipyridine),
which were originally reported as two different structures, are HEGJUZ[42] and XUVHEB.[43] The
two publications do not refer to each other. This is not surprising,
since HEGJUZ has space group P21/n and some disorder on the solvated dimethylformamide, while
XUVHEB has space group Pbcn and no disorder on the
solvate molecules but instead contains two additional uncoordinated
waters (Figure ).
Figure 4
Frameworks
of HEGJUZ (left) and XUVHEB (right) in [010] projection.
HEGJUZ and XUVHEB only differ in water clathrates and a disorder of
HEGJUZ. The cleaned frameworks are identical.
Frameworks
of HEGJUZ (left) and XUVHEB (right) in [010] projection.
HEGJUZ and XUVHEB only differ in water clathrates and a disorder of
HEGJUZ. The cleaned frameworks are identical.Finally, we illustrate that an analysis of the net topology
alone
is also not sufficient in general to distinguish frameworks, since
frameworks with different composition can share their net topologies.
Clearly, substituting one atom type with another will change the structure
but not the net. An example is IBICED[44] (or its analogue IBIDAA[44]), which differs
from IBICAZ[44] only by the type of halogen
atom in the [Zn(Hal)(mpmab)] framework (Figure ). A more complex reason for two different
structures to share the same net can be that they are formed from
enantiomeric ligands. An example is IBICON, which is the enantiomeric
isomer of IBICED and IBIDAA. While IBICED and IBIDAA are constructed
with the chiral L ligand and belong to the chiral space group P61, IBICON has space group P65 using the D ligand. Comparing the space groups of chiral
structures (e.g., P61 and P65) will tell enantiomeric pairs apart, but this is a
difficult task for frameworks taken from the CoRe MOF databases, since
all relaxed structures are stored in the space group P1 and the original information on the space group is lost.
Figure 5
Identical frameworks
of IBICED (top left) and IBIDAA (bottom left).
IBICON (top right) is their mirror image. IBICAZ (bottom right) is
only distinguished from IBICED and IBIDAA by the atom types: Br (orange
balls) is substituted by Cl (green balls).
Identical frameworks
of IBICED (top left) and IBIDAA (bottom left).
IBICON (top right) is their mirror image. IBICAZ (bottom right) is
only distinguished from IBICED and IBIDAA by the atom types: Br (orange
balls) is substituted by Cl (green balls).
Methods
To automatically search
for duplicates, we first compare the atom
types of networks and the composition and the graph of linkers and
subsequently analyze topological properties of the bond network and
its simplifications as described below. This analysis is very robust
in distinguishing networks of different topologies as well as in detecting
skeleton isomers. In principle, it is also possible to find stereoisomers
(enantiomers, cis/trans isomers, conformers) using information about
crystal symmetry and geometrical fingerprints.[45,46]The bond network of a structure is the graph
whose
vertices correspond to the atoms and whose edges correspond to interatomic
bonds. A network, net, or graph is a particular combinatorial structure that consists
of vertices and edges attached to the vertices. The degree
of a vertex is the number of end points of edges connected
to it. The degree of a vertex corresponds to the coordination of an
atom. The bond network is equivalent to the adjacency matrix of a structure: i.e., the matrix that lists all atoms and the bonds
between them. An underlying net of a structure is
a simplified version of the bond network. It is constructed by adding
a vertex for each structural group and connecting a pair of vertices
with an edge if the corresponding structural groups have a bond between
them.[40,47] We perform three different simplifications
on the bond network, which we further analyze. They are illustrated
for MOF-5 in Figure a.
Figure 6
Simplifications of MOF-5 (SAHYOG[48]):
(a) original MOF-5; (b) simplified adjacency matrix, net topology mof; (c) standard simplification, net topology fff; (d) clusters of the cluster simplification; (e) cluster simplification,
net topology pcu; (f) 2-fold interpenetrated version
of MOF-5 (HIFTOG[49]).
The simplified adjacency matrix is obtained by deleting isolated
and dangling atoms and suppressing
atoms that have only two bonds. Every vertex of the underlying net
with degree 1 is removed together with its adjacent edge, and edges
with an end point of degree 2 are contracted iteratively until the
minimal degree of the graph is 3 (the resulting graph is independent
of the order in which the deletions and edge contractions are performed)
(Figure b).The standard simplification considers metal atoms and organic ligands of a MOF as its structural
units and substitutes the atoms of each ligand by one dummy atom,
usually placed at the center of mass. In more general terms, anything
that is not a metal is contracted to its center of mass. That applies
not only to organic ligands but also to single nonmetal atoms, such
as oxygen, halogen, or multiatomic noncoordinated species (anion,
cation, solvent) (Figure c).The motivation
of the cluster
simplification is to recognize clusters of atoms by decomposing
the structure into pieces with high connectivity. For each bond, the
smallest ring of bonds is found that contains the bond. The ring sizes
are sorted by increasing values into the sequence a1 ≤ a2 ≤ ...
≤ a, where N is the number of bonds in the structure. If the sequence
contains a pair a, a such that a – a > 2, bonds whose smallest rings are formed
with
fewer than i + 1 bonds are considered to belong to
a cluster while the other bonds connect two clusters (Figure d). The cluster simplification
for i is obtained by substituting each cluster with
a dummy atom and keeping the bonds between clusters (Figure e). If there exist several
gaps in the sequence a, the structure permits several different cluster simplifications
and one cluster simplification is obtained for each index. Note that
identical structures have the same sets of cluster simplifications.Simplifications of MOF-5 (SAHYOG[48]):
(a) original MOF-5; (b) simplified adjacency matrix, net topology mof; (c) standard simplification, net topology fff; (d) clusters of the cluster simplification; (e) cluster simplification,
net topology pcu; (f) 2-fold interpenetrated version
of MOF-5 (HIFTOG[49]).An unlimited number of simplifications can be performed on
top
of each other, and it clearly matters in which order the simplifications
are performed. However, only finitely many nonidentical simplified
nets of a given structure can be obtained, as at some point it is
impossible to further simplify a net. To facilitate the analysis of
the network topology, we perform an adjacency matrix simplification
on top of both the standard simplification and the cluster simplification.
While the topology of the net obtained from simplifying the adjacency
matrix is often too specific to match one of the common three-letter
topologies, the net obtained by the cluster simplification is the
most simplified one and usually carries the topology that is commonly
assigned to a structure. For example, the topology of the net obtained
by simplifying the adjacency matrix of MOF-5 got its own name mof only because it is such a famous structure. However, one
would usually consider MOF-5 to be of primitive cubic topology pcu, which indeed is the topology of the net obtained by performing
a cluster simplification and subsequently simplifying the adjacency
matrix. Simplifying the adjacency matrix of the standard simplified
MOF-5 yields a net with topological type fff.The
standard and cluster descriptions coincide in many cases (239
from 488 structures in the 502 CoRe MOF database: 49%): namely, if
the structure building unit is a single metal atom and the ligand
is not branched, which prevents the underlying net from splitting
into several vertices with degree greater than 2. For example, both
simplified versions of [Cd(isonicotinate)2] AVAQIX[50] have dia (diamond) topology.The topological type of a net is a invariant, as are (extended)
point and vertex symbols. These are weaker notions than the net topology,[36] but the combination of the extended point symbol
and the vertex symbol is in praxis able to distinguish different topologies.
If a topology is not identified because it is not contained in the
ToposPro database of topologies, the point symbol and vertex symbol
can still be used to compare two structures. However, two nets with
the same net topology might have different structural building units.
For example, KAYBIX and KAYBUJ[51] have the
same composition C7CaH3NO4, and their
standard simplified nets both have 5,5T7 topology. However, they are
not duplicates since their ligands are isomers: pyridine-2,5-dicarboxylate
anion and pyridine-2,4-dicarboxylate anion, respectively. Such a difference
can be detected by comparing the graphs of ligands, which were analyzed
by computing the coordination modes of ligands and metals following
the approach of Serezhkin et al.[52] A difference
in one of the obtained graph descriptors, namely the coordination
mode of the ligand (in brackets) and an identifier for their composition
(in braces), is sufficient to conclude that two structures are chemically
different. Examples are given in Table . Coordination isomers and illustrated in Figures SI1 and SI2.
Table 2
Compounds with the
Same Stoichiometric
Compositions and Ligands but Different Modes of Ligand Coordination
compound
refcode
ligand
[Cd3(μ6-biphenyl-3,4′,5-tricarboxylato)2]
HEKTUO
C15H7O6[G42]{196}
QEKLID, QEKLID01
C15H7O6[G51]{196}
[Y(benzene-1,3,5-tricarboxylato)]
SEHTEF
C9H3O6[G22]{158}
LAVSUY
C9H3O6[G42]{158}
NADZEZ
C9H3O6[G6]{158}
[Y2(terephthalate)3]
LAGNOY
C8H4O4[K22]{78}
C8H4O4[K4]{78}
LAGNUE
C8H4O4[K4]{78}
[Y2(pyridine-3,5-dicarboxylato)3]
SERJUV
C7H3NO4[K22]{290}
C7H3NO4[K31]{290}
C7H3NO4[K4]{290}
SERKEG
C7H3NO4[K22]{290}
The topological type
of a framework contains no information on
interpenetration, but ToposPro is able to determine the degree of
interpenetration. We add this check to our analysis and distinguish
differently interpenetrated versions of a structure. For example,
HIFTOG[49] is a 2-fold interpenetrated version
of MOF-5 (Figure f).
It is also possible to detect rare cases of entanglement isomers by
using the extended ring net.[53,54]To analyze the
502 CoRe MOF database, we performed the steps given
below. They turned out to give a test, which is not only sufficient
but also necessary to distinguish MOFs up to enantiomers.We
did not compare the exact number of atoms, since the CoRe MOF
database contains structures given in multiples of the unit cell (e.g., Figure a,b), but the ratios
between elements and between central atoms and ligands were determined.
At each step, uniquely determined structures were filtered out and
sets of indistinguishable structures compared during the following
steps.
Figure 7
(a) SAKRED and SEFBOV,
(b) KAXQOR and ZERQOE, and (c) GOMRAC and
GOMREG are duplicates in the 502 CoRe MOF database since their physical
structures differ only by some disorder.
composition
(atom types and stoichiometry),
i.e. empirical formulacentral atom type: ligand graph, composition,
and coordinationtopological
type of the net obtained
by standard simplificationtopological type of the net obtained
by simplifying the adjacency matrixtopological type of the net obtained
by cluster simplificationdegree of interpenetration(a) SAKRED and SEFBOV,
(b) KAXQOR and ZERQOE, and (c) GOMRAC and
GOMREG are duplicates in the 502 CoRe MOF database since their physical
structures differ only by some disorder.Clearly, the order of the steps can be interchanged. In particular,
the cost of computing the net type of a more complicated net competes
with the cost of highly simplifying a net. Therefore, interchanging
steps 3 and 5 will require more effort to compute the simplifications
but less effort to compute the net topologies.
Results and Discussion
We investigated the 502 CoRe MOF database with 502 DFT relaxed
structures with assigned DDEC partial atomic charges as an example.
Of these, 488 were considered to be reliable for comparative analysis.
While searching for duplicates, we performed some simple tests on
the integrity of the 502 CoRe MOF database, such as searching for
too short interatomic contacts and wrongly coordinated atoms. That
flagged 66 entries with potential problems. Before we performed our
analysis, we replaced in 46 structures erroneous atom coordinates
by their positions before relaxation to maintain the net. We furthermore
detected errors in 14 structures that were mainly caused by the removal
of solvents that are structural building blocks or attached to the
structure and chemically important or by the removal of charged anions
without balancing the charges. In these cases, it is not surprising
that the DFT optimization dramatically changes the network by breaking
and rejoining valence bonds. We excluded 14 structures, for which
hydrogens (CISMAT01, CUNXIS, CUNXIS10, GIHBII, XUWVEG), anions (AVEMOE,
BICDAU, SENWAL, SENWIT, SENWOZ), or cations (VAHSIH, MODNIC) were
missed or excess atoms were present (IJIROY, YIWMIA). Among the removed
structures is AVEMOE,[55] from which a bridging
coordinate sulfate anion was removed together with a terminal water
ligand. As a result, the removed charge is not balanced and the DFT
relaxed structure has not only a very different cell but even uncoordinated
Ag atoms and the underlying net consequently differs from the original
one. The atomic charges are also incorrect for BICDAU,[56] where terminal acetate ligands were excluded
from the structure and thus could not be taken into account in the
DFT calculations. Details and the list of problematic structures are given in the Supporting Information.
Duplicates
As can be expected from the generation of
the CoRe MOF database, most of its duplicates originate from structures
in the CSD that differ only by their clathrate solvents. The CSD refcode of each structure, all chemical data, the results from the analyses of the nets, and a list of all duplicates are contained in the Supporting Information. In addition, a list of structures that should be removed to obtain a duplicate-free version of the database is given in the Supporting
Information. Here, whenever one representative was correct and the
other erroneous, the correct one is kept, and if all representatives
were correct but were reported with different multiples of the unit
cell, the representative with larger cell volume is removed. Examples
are discussed below.We followed the procedure outlined in Methods. In the first step, we
found 325 materials uniquely determined by their composition and detected
a further 163 structures distributed among 59 unique empirical formulas.
We then examined the structures with the same empirical formula separately
by comparing them in the next step. The second step found a further 28 uniquely determined structures from the 163,
and the resulting 135 structures with duplicate ligand sets were distributed
among 47 representatives. Among the 28 unique compounds are six pairs
of structures with isomeric ligands (see Table and Methods for
anexplanation).
The differences
are highlighted
in boldface.In the same set of 28 structures, 10 coordination isomers are found,
which differ in the coordination mode of the ligand (in brackets)
in complexes of the same composition (identified by the same number
in braces) (see Table ).For example, there are two types of [Cd3(μ6-biphenyl-3,4′,5-tricarboxylato)2] complexes,
in which the hexadentate ligand is either coordinated in G[42] mode (HEKTUO[57]) or
in G[51] mode (QEKLID,[58] QEKLID01[59] see Figure SI1). The difference in the coordination mode also
leads to different underlying topologies of the standard simplified
nets, 4,4,6T38 and 4,4,6T24, respectively. The original structures
(in the CSD) differ in addition by terminal ligands, namely dimethylacetamide
(HEKTUO) and dimethylformamide (QEKLID, QEKLID01), and water solvates
contained in QEKLID and QEKLID01 but not in HEKTUO. Other examples
are the three different clathrate structures of [Y(benzene-1,3,5-tricarboxylato)]
that are distinguished by the coordination mode of the tricarboxylate:
G[22] (SEHTEF;[60] dimethylformamide and dimethyl sulfoxide), G[42] (LAVSUY;[60] dimethylformamide),
and G[6] (NADZEZ;[61] dimethylformamide and water) (see Figure SI2). The topologies of the underlying nets obtained by standard simplification
are also different: namely 4-c sra, 6,6T2, and 6-c htp, respectively. One more striking example is the pair SERJUV[62] and SERKEG,[62] which
can be distinguished by the coordination modes of their ligands as
well as by the topologies of their simplified adjacency matrices,
while the topological type of their nets obtained from standard simplification
is stp for both.Examination of the remaining 135
structures in step 3, i.e. comparison of their nets obtained
from standard simplification,
identifies an additional 7 structures as unique (see Table ). The 128 structures so obtained
with potential duplicates occur in 45 unique combinations of composition,
ligand symbol, and topology of the net obtained from standard simplification.
Table 3
Skeleton Isomers Revealed at the Third
Step of the Analysis (Comparing the Topologies of the Nets Obtained
from Standard Simplification)a
The seven unique structures are
underlined.Comparing the
topological types of the nets obtained from the matrix
simplification in step 4 detects two more unique
structures (NUTQEZ and XUNGOD), and one quartet of [Ca(4,4′-sulfonyldibenzoato)]
structures (ZERQOE,[63] KAXQOR,[64] KAXQIL,[64] KAXQOR01[65]), originally containing different clathrates,
is split into two pairs of isomers (ZERQOE-KAXQOR, KAXQIL-KAXQOR01).
KAXQOR and KAXQOR01 are the only examples of real polymorphs in the
502 CoRe MOF database. Therefore, the number of possible duplicated
structures reduces to 126 and the number of unique representatives
increases to 46.Comparing the topological types of the nets
obtained from cluster
simplification in step 5 does not distinguish any
additional structures, which can be explained by this commonly used
representation being the simplest notion of underlying nets.[47] Even on analyzing the large CoRE MOF database
with more than 4700 structures, we did not find any structures that
step 5 distinguishes but which were not already differentiated by
the previous steps. However, we include the cluster representation
in our analysis for it captures the net topology that is usually used
to classify and describe the topology of a MOF. Furthermore, the order
in which the steps of the algorithm can be performed is a matter of
choice as described in Methods.Counting
the degree of interpenetration in step 6 allows differentiation
of four isomers. The [Zn2(2,2′-bitiophene-5,5′-dicarboxylato)2(4,4′-bipyridyl)] framework is twice listed as 2-fold
interpenetrated (GUYLOC, GUYMAP) and the 3-fold interpenetrated analogue
is given two times as well (GUYLUI, GUYLUI01). The well-known MOF-5
framework of the composition [Zn4O(benzene-1,4-dicarboxylato)3] is found twice in its 2-fold interpenetrated version (HIFTOG,
HIFTOG02), and is listed 17 times as single framework. Consequently,
the number of possibly distinct structures in the previous list of
126 structures is now 46 + 2 = 48. The remaining 126 structures cannot
be uniquely described by applying our set of invariants. Indeed, all of the indistinguishable
structures have multiple entries: we find 39 pairs, 5 triples, 2 quadruples,
one structure that is deposited 8 times, and MOF-5 with 17 entries.
Most duplicates are caused by the removal of different clathrates/solvent
molecules from the original structure. For example, KAXQOR[64] and ZERQOE[63] (Figure b) only differ in
the CSD by the CO2 adsorbed in ZERQOE. Similarly, WOWGEU[66] and GUXLIU[67] are
independently listed in the CSD only because they contain different
numbers of clathrate water molecules in the pores of the framework
[Al2F2(ethylenediphosphonato)]. Two structures,
JAVNIE[68] and FUSWIA,[69] differ by water coordinated to the copper atoms of the
framework [Cu3Cl2(5-(4-pyridyl)tetrazolato)4], which is present in JAVNIE but absent in FUSWIA, as well
as by the clathrate molecules dimethylformamide and methanol in FUSWIA
and dimethylformamide and water in JAVNIE. More examples of duplicates
caused by different solvent molecules are the pairs AMIMEP[41] and AMILUE,[41] and
HEGJUZ[42] and XUVHEB,[43] which are discussed in Methods (Figures and 4), SAKRED[70] and SEFBOV,[71] and KAXQOR[64] and
ZERQOE[63] (Figure a,b). However, the pair GOMRAC[72] and GOMREG[72] of AlPO4 is a duplicate due to neglect of the metal disorder (Figure c). In both materials,
a third of the aluminum sites are substituted, but while GOMRAC contains
zinc, GOMREG contains manganese. Although the two materials are different,
they are both stored with full occupation of aluminum in the 502 CoRe
MOF database and must therefore be counted as duplicates.In Chart we summarize
the structures that are distinguished during the six steps of our
algorithm applied to the 502 CoRe MOF database, leaving 126 duplicates
in 48 groups (UNIQUE structures: 325 + 28 + 7 + 2 + 48 = 410): 16%
of 488 structures are redundant.
Chart 1
Detecting Duplicates in the 502 CoRe
MOF Database
Statistical Errors Caused
by Multiple Entries in a Database
We close with an example
of the significance of cleaning databases
from duplicates before drawing statistical conclusions. The following
examination of interpenetration gives a simple example for a misleading
statistical analysis caused by multiple entries: if we consider all
502 structures of the 502 CoRe MOF database, we find 58 2-fold, 16
3-fold, 9 4-fold, 3 5-fold, 1 6-fold, 3 7-fold, and 2 8-fold interpenetrated
structures. However, if duplicates are removed, we find that there
is only one 7-fold interpenetrated structure of [Zn(4-(2-(pyridin-4-yl)vinyl)benzoato)2]: namely, UVARIT = UVAROZ = UVASAM[73] (dia). Similarly, the 3-fold interpenetrated structures
contain the double GUYLUI[74] and GUYLUI01,[75] and the 2-fold interpenetrated structures contain
7 doubles. The numbers of interpenetrated structures should instead
read as 51 2-fold, 15 3-fold, 9 4-fold, 3 5-fold, 1 6-fold, 1 7-fold,
and 2 8-fold interpenetrated structures (see Chart ). The degree of interpenetration is given
in the file dealing with duplicates in
the Supporting Information.
Chart 2
Statistics of the Interpenetration
Conclusions
We
have presented a rigorous method to distinguish MOFs that is
based on an analysis of the bond network. In contrast to approaches
that rely on comparing atom numbers and cell parameters or properties
such as atom positions, pore volume, and surface area, we are able
to reliably distinguish structures and respectively detect duplicates,
even when frameworks are distorted. Although superimposable duplicates
would be found by purely geometrical descriptors, even large differences
in any of them do not allow the conclusion that two structures are
different. However, nonidentical structures can be more similar than
two different relaxations of one structure with respect to purely
geometrical descriptors. For example, if a symmetry is broken by relaxation
or if different clathrates induce distinct symmetries, multiples of
the original unit cell can be needed to describe the relaxed cleaned
structure, which makes it useless to compare the number of atoms or
cell parameters. In contrast, the properties that we obtain from the
bond network such as its atom types, topology, dimensionality, interpenetration,
and point and vertex symbols remain unchanged for all representations
of a structure. It immediately follows that in order to distinguish
two structures, it is sufficient that they differ in one of these
properties.As an example, the 502 CoRe MOF database of 502
DFT relaxed MOFs
with assigned DDEC partial atomic charges was investigated, showing
that 15.5% (78) of the structures are redundant duplicates. A total
of 9.2% (46) structural files contains incorrect atomic coordinates
that affect the network topology and were replaced before the study,
and 2.8% (14) structures have wrong framework compositions. In all,
502 – 78 – 14 = 364 structures are reliable, which is
72.5% of the database. The analysis was performed using ToposPro.
Authors: Jasmina Hafizovic; Morten Bjørgen; Unni Olsbye; Pascal D C Dietzel; Silvia Bordiga; Carmelo Prestipino; Carlo Lamberti; Karl Petter Lillerud Journal: J Am Chem Soc Date: 2007-03-07 Impact factor: 15.419
Authors: Li-Chiang Lin; Adam H Berger; Richard L Martin; Jihan Kim; Joseph A Swisher; Kuldeep Jariwala; Chris H Rycroft; Abhoyjit S Bhown; Michael W Deem; Maciej Haranczyk; Berend Smit Journal: Nat Mater Date: 2012-05-27 Impact factor: 43.841
Authors: Sergey A Sapchenko; Denis G Samsonenko; Danil N Dybtsev; Maxim S Melgunov; Vladimir P Fedin Journal: Dalton Trans Date: 2010-11-22 Impact factor: 4.390
Authors: Arni Sturluson; Melanie T Huynh; Alec R Kaija; Caleb Laird; Sunghyun Yoon; Feier Hou; Zhenxing Feng; Christopher E Wilmer; Yamil J Colón; Yongchul G Chung; Daniel W Siderius; Cory M Simon Journal: Mol Simul Date: 2019 Impact factor: 2.178
Authors: Francoise M Amombo Noa; Erik Svensson Grape; Steffen M Brülls; Ocean Cheung; Per Malmberg; A Ken Inge; Christine J McKenzie; Jerker Mårtensson; Lars Öhrström Journal: J Am Chem Soc Date: 2020-05-05 Impact factor: 15.419
Authors: Thomas C Nicholas; Eugeny V Alexandrov; Vladislav A Blatov; Alexander P Shevchenko; Davide M Proserpio; Andrew L Goodwin; Volker L Deringer Journal: Chem Mater Date: 2021-10-27 Impact factor: 10.508
Authors: Taisiya S Sukhikh; Evgeny Yu Filatov; Alexey A Ryadun; Konstantin A Kovalenko; Andrei S Potapov Journal: Molecules Date: 2022-10-01 Impact factor: 4.927
Authors: Rémi Pétuya; Samantha Durdy; Dmytro Antypov; Michael W Gaultois; Neil G Berry; George R Darling; Alexandros P Katsoulidis; Matthew S Dyer; Matthew J Rosseinsky Journal: Angew Chem Int Ed Engl Date: 2022-01-12 Impact factor: 16.823