Literature DB >> 35194039

Zeo-1, a computational data set of zeolite structures.

Leonid Komissarov1, Toon Verstraelen2.   

Abstract

Fast, empirical potentials are gaining increased popularity in the computational fields of materials science, physics and chemistry. With it, there is a rising demand for high-quality reference data for the training and validation of such models. In contrast to research that is mainly focused on small organic molecules, this work presents a data set of geometry-optimized bulk phase zeolite structures. Covering a majority of framework types from the Database of Zeolite Structures, this set includes over thirty thousand geometries. Calculated properties include system energies, nuclear gradients and stress tensors at each point, making the data suitable for model development, validation or referencing applications focused on periodic silica systems.
© 2022. The Author(s).

Entities:  

Year:  2022        PMID: 35194039      PMCID: PMC8863849          DOI: 10.1038/s41597-022-01160-5

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

Atomistic models are an essential tool for the prediction of thermodynamic, mechanical or biochemical properties of a substance. More recently, the use of pre-trained models has become increasingly popular due to their comparably low complexity and high accuracy on modern hardware[1-6]. In order for such models to perform well, their empirical parameters require fitting to high-quality reference data. Depending on the application, reference data are either experimental, or come from computationally more expensive ab initio calculations. Although there are already a handful of large computational data sets covering small organic molecules[7-9], such data is still scarce for larger periodic systems (cf. Materials Cloud Archive[10,11] or the NOMAD database[12,13]). Motivated by this fact, we present a quantum-chemical data set for zeolites. Zeolites are porous materials comprised of interconnected SiO4 or AlO4 tetrahedra. Their properties can be fine-tuned through synthesis of materials with specific pore size, or the inclusion of additional metal cation sites[14-17]. Because of their topology and synthetic flexibility, zeolites have various applications as adsorbents[18-20] and catalysts[17,21-23]. To this day, a myriad of different zeolite framework types is available experimentally, and many more hypothetical structures can be derived[24-26]. The documentation of fundamental zeolite framework types and derived materials has led to the publication of the well-known Atlas of Zeolite Structures[27] in several editions. The atlas lists each unique framework type by its three-letter-code, as assigned by the by the Structure Commission of the International Zeolite Association (IZA). Today, its contents are available online at the Database of Zeolite Structures[28], which we use as a source of initial structures for our data set. In this first installment, we include properties for 204 out of the currently available 256 zeolite framework types in the database (a total of 226 unique geometries when also considering derived materials). Our descriptor provides the complete optimization trajectories for each system with atomic positions, lattice vectors, atomic gradients and stress tensors at each step. We envision future extensions of the data set to focus on derived geometries, covering structural defects and host-guest interactions.

Methods

Initial zeolite structures are collected from the public Database of Zeolite Structures[28] in the Crystallographic Information File (CIF) format, before conversion to the XYZ format with the Atomic Simulation Environment[29] (ASE) package. After selection of all systems with less than 301 atoms, each is manually filtered by removing redundant atom positions in case of fractional occupancies and adding missing hydrogen atoms where needed. Each structure’s coordinates and cell parameters are energy-minimized with the periodic density functional code BAND[30], as implemented in the Amsterdam Modeling Suite[31] (AMS). The calculations are performed with the revPBE functional[32,33], a ‘Small’ frozen core and the double-ζ polarized (DZP) basis set. Grimme’s D3(BJ) dispersion correction[34] is applied to all calculations. Previous research has shown that the selected level of theory can accurately reproduce zeolite geometries, albeit slightly overestimating the Si-O bond length (in the range of 2 pm) and smaller Si-O-X angles (in the range of 5 degrees) when compared to experimental results[35,36]. At the same time, dispersion-corrected functionals are generally more accurate when describing adsorption processes[37-39]. For the optimization of the initial structures, geometry convergence criteria are left at their default values, namely 0.001 Hartree/Å, 0.00001 Hartree/Atom and 0.1 Å for atomic gradients, energy and atomic displacements respectively. We use a Quasi-Newton optimizer[40] in the delocalized coordinates space for the initial optimizations. Cases of problematic convergence are restarted with the FIRE[41] optimizer.

Data Records

The data is made available at the Materials Cloud Archive[42]. Each system’s trajectory is stored in an individual NumPy[43]. npz file. We describe the data types held in each file in Table 1, storing the complete geometry optimization trajectory, including atomic coordinates, system energies, nuclear gradients, lattice vectors and stress tensors for each geometry optimization step. Entries at the first position correspond to the input structure; the last position holds the data for the final, optimized structure. Hirshfeld partial charges[44] are provided for the final (optimized) geometries. Atomic coordinates and lattice vectors are stored in ångström, all other properties are stored in atomic units.
Table 1

Overview of the data structures stored in a .npz file.

DataUnitKeyArray Shape
Atomic Numbersnumbers(R,)
Atomic CoordinatesÅxyz(N, R, 3)
x-, y- and, z-Components of the Lattice VectorsÅlattice(N, 3, 3)
Energyhartreeenergy(N,)
Nuclear Gradientshartree/bohrgradients(N, R, 3)
Stress Tensorsatomic unitsstress(N, 3, 3)
Hirshfeld Chargesatomic unitscharges(R,)

Each array can be accessed through the respective key. The variables N and R denote the number of geometry optimization steps and the system size respectively. Partial charges are only computed for the last geometry.

Overview of the data structures stored in a .npz file. Each array can be accessed through the respective key. The variables N and R denote the number of geometry optimization steps and the system size respectively. Partial charges are only computed for the last geometry.

Technical Validation

The complete data set includes geometry optimizations of 226 systems, resulting in a total of 32550 geometries. System sizes range between 15 and 334 atoms (mean: 126). We illustrate the convergence of all reference calculations in Fig. 1, showing that all optimized systems are well within the defined convergence criteria. Elemental occurrences in the data set are listed in Table 2. Si-O, Si-Si distances as well as Si-O-Si angles are presented in Fig. 2 as the most prominent geometrical descriptors. As most of the initial structures from the IZA database are idealized geometries[45], a sharp mean for the Si-O bond distance can be observed at roughly 161 pm (Fig. 2a, blue histogram). Long tails in the distribution vanish and the mean is shifted towards approximately 164 pm when considering geometry-optimized structures (Fig. 2a, orange histogram). Considering the Si-O-Si angles, a slight shift towards smaller values is observed (mean of 149 vs. 142 degrees, Fig. 2c). Both effects have been previously reported by Fischer et al.[35,36] and are inherent to the selected level of theory. Distributions of the Si-Si distances in the second coordination sphere do not shift significantly when comparing initial and optimized geometries (Fig. 2b). Relative changes in the cell volumes are presented in Fig. 3 as the ratio of each system’s optimized-to-initial volume. Values below 1 translate to a shrinking unit cell as the optimization progresses. Overall, the geometrical descriptors are in good agreement with experimental data[46-51]. Additional averages for bond distances and angles are summarized in Tables 3, 4 respectively. Distributions of energies, atomic gradients, cell volumes and stress tensors are depicted in Fig. 4. As expected from geometry optimization trajectories, all properties have – with the exception of relative cell volumes – a distinct mean close to zero. Structures close to the initial input geometries contribute to the relatively high standard deviations. Evaluation of the relative cell volumes shows a shifted distribution, with roughly 76% of all structures having a larger volume than their respective optimized geometry. A detailed overview of all calculated structures, sorted by their IZA three-letter-code, the system size and number of iterations is provided in Online Table 1.
Fig. 1

Distribution of convergence criteria at the last optimization step for all calculated systems in the data set. Showing (a) the highest absolute component of all nuclear gradients, (b) change in system energy and (c) highest relative atomic displacement.

Table 2

Elemental occurrences in the complete data set.

ElementOccurrence
Si226
O226
H21
Al12
N4
Ca4
Ge3
Li2
Na2
K2
C2
F1
Be1
Cs1
Ba1

Counting all structures containing at least one atom of the listed element. Each element’s isolated atomic energy is listed in hartree.

Fig. 2

Distributions of (a) Si-O bond lengths, (b) Si-Si distances in the second coordination sphere and (b) Si-O-Si angles as calculated from all geometries in the data set. Blue and orange bars denote data from initial and optimized geometries, respectively. Mean μ and standard deviation σ printed in the same color as the underlying data. N denotes the total sample size.

Fig. 3

Distribution of relative cell volumes per system as the quotient of optimized-to-initial cell volumes. Values below 1 describe a shrinking cell as the optimization progresses. Black line marks V/V0  =  1. Sample size is 226.

Table 3

Mean atomic bond length distributions and their standard deviations (std. dev.) in in ångström.

BondMeanStd. Dev.Number of points
Si-O1.6380.008530439
H-O0.9990.1174266
Al-O1.7630.0133234
Ge-O1.7950.0239202
Na-O2.4730.0841104
C-C1.5400.0049100
C-H1.1000.002798
K-O3.1750.480961
Ca-O2.4690.092557
N-H1.0550.091350
Si-K3.9450.319641
Cs-O3.4290.282028
Li-O1.9700.026321
Be-O1.6690.015216
Al-K3.6250.165014
C-N1.4720.003710
Ba-O2.9030.126110

Averaged over all geometry-optimized structures.

Table 4

Mean Si-O-R angle distributions and their standard deviations (std. dev.) in degrees.

AngleMeanStd. Dev.Number of points
Si-O-Si148.711.212548
Si-O-Al140.68.9170
Si-O-K106.88.881
Si-O-Na112.815.164
Si-O-Ge143.212.052
Si-O-H110.77.940
Si-O-Cs101.66.936
Si-O-Ca118.516.619
Si-O-Be129.90.216
Si-O-Li112.74.28
Si-O-Ba112.614.15

Averaged over all geometry-optimized structures.

Fig. 4

Distributions of physical quantities in the data set. Showing (a) energy differences per atom, relative to the respective energy of the optimized system; (b) atomic gradient components; (c) unit cell volumes, relative to the optimized system’s volume; (d) stress tensor components. Data is printed on a logarithmic y-scale for a clear display of the distribution. Mean μ and standard deviation σ printed in the same units as the underlying data. N denotes the total sample size.

Online Table 1

Summary of all calculated systems, sorted by their IZA code. Showing the chemical formula, system size R and number of iterations N.

IZA CodeChemical FormulaRN
ABWSi8O1624102
ABW_0Si4Al4O20Li4H84030
ACOSi16O32484
AEISi48O9614416
AELSi40O8012016
AENSi48O96144137
AETSi72O144216188
AFGSi48O96144248
AFISi24O4872369
AFNSi32O649672
AFOSi40O8012019
AFRSi32O6496283
AFSSi56O11216820
AFTSi72O144216471
AFVSi30O609014
AFXSi48O9614429
AFYSi16O324842
AHTSi24O4872263
ANASi48O961447
APCSi32O6496245
APDSi32O6496408
ASTSi40O80120178
ASVSi20O406022
ATNSi16O3248168
ATOSi36O72108100
ATSSi24O4872299
ATTSi12O243663
ATVSi24O4872401
AVLSi42O8412654
AWOSi48O9614421
AWWSi24O487211
BCTSi8O16245
BEASi64O12819261
BECSi32O6496395
BIKSi12O2436106
BIK_0Si4Al2O14Li2H42654
BOFSi24O487218
BOGSi96O19228821
BOZSi92O18427614
BPHSi28O5684143
BRESi16O324823
BSVSi96O1922889
CANSi12O243614
CASSi24O4872316
CAS_0Cs4Si20Al4O4876325
CDOSi36O72108987
CFISi32O6496339
CGFSi36O7210830
CGSSi32O6496307
CHASi36O7210820
CHISi28O60H89625
CONSi56O11216837
CSVSi20O4060342
CZPSi24O487214
DACSi24O4872293
DFTSi8O162410
DOHSi34O68102359
DONSi64O12819230
EABSi36O7210855
EDISi5O10155
EDI_0Ba2Si6Al4O28H1656198
EMTSi96O19228811
EPISi24O4872398
ERISi36O7210820
ESVSi48O9614420
ETL_0Si72O14421611
ETRSi48O9614410
ETVSi14O284225
EWOSi24O487221
EWSSi96O19228819
EZTSi48O9614412
FARSi84O16825239
FERSi36O7210825
FRASi60O120180264
GISSi16O32487
GIUSi96O19228837
GMESi24O487212
GONSi32O649616
GOOSi32O64961627
GOO_0Ca2Si12Al4O42H2080195
HEUSi36O7210846
IFOSi32O6496179
IFRSi32O649615
IFWSi64O128192138
IFYSi48O9614413
IRNSi92O184276417
IRRSi52O10415614
IRR_0Ge12Si40O10415629
IRYSi76O154H4234241
ISVSi64O12819240
ITESi64O12819227
ITGSi56O11216829
ITHSi56O11216865
ITNSi54O109H2165518
ITTSi46O9213843
ITWSi24O4872259
IWRSi56O11216847
JBWSi6O121812
JNTSi32O649633
JRYSi24O4872232
JSNSi16O324816
JSRSi96O19228811
JSTSi48O961448
JSWSi48O9614419
KFISi96O192288170
LAUSi24O487215
LAU_0Ca4Si16Al8O64H32124306
LEVSi54O10816216
LEV_1Si54O108N8C60H10433414
LIOSi36O72108660
LITSi24O52H88424
LOSSi24O4872393
LOVSi18O3654468
LTASi24O487241
LTJSi16O324811
LTJ_0Si8Al8O36N8H40100172
LTLSi36O7210814
MARSi72O144216561
MAZSi36O7210842
MEISi34O68102307
MELSi96O19228847
MEPSi46O92138379
MERSi32O6496131
MFISi96O192288175
MFSSi36O721081351
MONSi16O324886
MORSi48O9614443
MRESi48O9614411
MRTSi24O487280
MSOSi90O18027071
MTFSi44O8813215
MTT_0Si24F2O48N2H884298
MTWSi28O5684386
MVYSi12O2436122
MWWSi72O14421616
NABSi10O2030455
NAB_0Si16Na8O56Be4H3211672
NATSi20O4060303
NAT_0Si24Al16Na16O96H32184103
NAT_3Ca4Si12Al8O52H24100251
NONSi88O1762647
NPOSi6O121819
NPTSi36O721089
NSISi12O243673
OBWSi76O15222811
OFFSi18O365437
OKOSi68O13620435
OSISi32O649611
OSOSi9O18278
OWESi16O324839
PARSi32O68H810824
PCRSi60O12018040
PCR_0Si60O12018038
PCSSi64O12819232
PHISi32O6496882
PONSi24O487233
PORSi64O128192288
POSSi64O12819221
PTTSi24O4872355
PTYSi10O203020
PUNSi36O7210830
PWOSi20O4060307
PWWSi40O8012030
RHOSi48O961447
RROSi18O365435
RSNSi36O7210817
RTESi24O487222
RTHSi32O6496205
RUTSi36O7210849
RWYSi48O9614419
SAFSi64O12819216
SAOSi56O112168121
SASSi32O6496338
SATSi72O14421623
SAVSi48O96144730
SBNSi10O203014
SBSSi96O19228851
SEWSi66O13219832
SFESi14O2842178
SFFSi32O6496256
SFGSi74O148222212
SFHSi64O12819226
SFNSi32O6496313
SFOSi32O649619
SFSSi56O11216826
SIVSi64O12819249
SODSi12O24365
SOFSi40O80120372
SORSi48O9614418
SOSSi24O4872323
SSFSi54O108162145
SSYSi28O5684335
STFSi32O649622
STF_0Si16O324848
STISi72O14421613
STTSi64O12819221
STWSi60O12018040
SVRSi92O192H1630027
SVVSi56O11216861
SWYSi72O14421621
SZRSi36O7210812
SZR_0K4Si32Al4O7211232
TERSi80O160240114
THOSi10O2030225
TOLSi72O144216217
TONSi24O4872334
TON_0Si24O48N1C4H1188389
UEISi48O9614440
UFISi64O12819230
UFI_0K8Si56Al8O128200243
UOESi12O243613
UOSSi24O487211
UOZSi40O8012019
USISi40O8012031
UTLSi76O15222870
UTL_0Ge16Si60O152228216
UWYSi60O12018026
UWY_0Ge36Si24O142H44246512
VETSi17O3451384
VFISi36O72108288
VSVSi36O7210827
WEISi20O406017
WENSi20O41H263357
YUGSi16O324822
YUG_0Ca2Si12Al4O40H1674372
ZONSi32O6496307
Distribution of convergence criteria at the last optimization step for all calculated systems in the data set. Showing (a) the highest absolute component of all nuclear gradients, (b) change in system energy and (c) highest relative atomic displacement. Elemental occurrences in the complete data set. Counting all structures containing at least one atom of the listed element. Each element’s isolated atomic energy is listed in hartree. Distributions of (a) Si-O bond lengths, (b) Si-Si distances in the second coordination sphere and (b) Si-O-Si angles as calculated from all geometries in the data set. Blue and orange bars denote data from initial and optimized geometries, respectively. Mean μ and standard deviation σ printed in the same color as the underlying data. N denotes the total sample size. Distribution of relative cell volumes per system as the quotient of optimized-to-initial cell volumes. Values below 1 describe a shrinking cell as the optimization progresses. Black line marks V/V0  =  1. Sample size is 226. Mean atomic bond length distributions and their standard deviations (std. dev.) in in ångström. Averaged over all geometry-optimized structures. Mean Si-O-R angle distributions and their standard deviations (std. dev.) in degrees. Averaged over all geometry-optimized structures. Distributions of physical quantities in the data set. Showing (a) energy differences per atom, relative to the respective energy of the optimized system; (b) atomic gradient components; (c) unit cell volumes, relative to the optimized system’s volume; (d) stress tensor components. Data is printed on a logarithmic y-scale for a clear display of the distribution. Mean μ and standard deviation σ printed in the same units as the underlying data. N denotes the total sample size.

Usage Notes

No data points were filtered as outliers with regards to the distributions of chemical properties (see. Figure 4). Consecutive structures from the same optimization trajectory will be autocorrelated. The data repository provides an interactive plotting script, displaying the system energy, maximum absolute component of the nuclear gradients and the cell volume at every iteration step for each structure. This requires the Bokeh[52] (v. 2.3.1) package for Python to be installed. SHA-1 hash sums are provided for each file to guarantee data integrity, as well as an example input script for a calculation with BAND. Naming conventions: Derived materials are referred to by their IZA three-letter-code, e.g. H-EU-12 is tabulated as ETL_0. Leading non-alphabetical characters have been removed, e.g. *-ITN is tabulated as ITN.
Measurement(s)potential energy
Technology Type(s)Computational Chemistry
Factor Type(s)Crystal structure, composition and topology
  21 in total

1.  Generalized Gradient Approximation Made Simple.

Authors: 
Journal:  Phys Rev Lett       Date:  1996-10-28       Impact factor: 9.161

2.  Precise density-functional method for periodic structures.

Authors: 
Journal:  Phys Rev B Condens Matter       Date:  1991-10-15

3.  Multipore zeolites: synthesis and catalytic applications.

Authors:  Manuel Moliner; Cristina Martínez; Avelino Corma
Journal:  Angew Chem Int Ed Engl       Date:  2015-02-09       Impact factor: 15.336

4.  The atomic simulation environment-a Python library for working with atoms.

Authors:  Ask Hjorth Larsen; Jens Jørgen Mortensen; Jakob Blomqvist; Ivano E Castelli; Rune Christensen; Marcin Dułak; Jesper Friis; Michael N Groves; Bjørk Hammer; Cory Hargus; Eric D Hermes; Paul C Jennings; Peter Bjerre Jensen; James Kermode; John R Kitchin; Esben Leonhard Kolsbjerg; Joseph Kubal; Kristen Kaasbjerg; Steen Lysgaard; Jón Bergmann Maronsson; Tristan Maxson; Thomas Olsen; Lars Pastewka; Andrew Peterson; Carsten Rostgaard; Jakob Schiøtz; Ole Schütt; Mikkel Strange; Kristian S Thygesen; Tejs Vegge; Lasse Vilhelmsen; Michael Walter; Zhenhua Zeng; Karsten W Jacobsen
Journal:  J Phys Condens Matter       Date:  2017-03-21       Impact factor: 2.333

5.  PiNN: A Python Library for Building Atomic Neural Networks of Molecules and Materials.

Authors:  Yunqi Shao; Matti Hellström; Pavlin D Mitev; Lisanne Knijff; Chao Zhang
Journal:  J Chem Inf Model       Date:  2020-01-29       Impact factor: 4.956

6.  Accurate structures and energetics of neutral-framework zeotypes from dispersion-corrected DFT calculations.

Authors:  Michael Fischer; Ross J Angel
Journal:  J Chem Phys       Date:  2017-05-07       Impact factor: 3.488

7.  Quantum chemistry structures and properties of 134 kilo molecules.

Authors:  Raghunathan Ramakrishnan; Pavlo O Dral; Matthias Rupp; O Anatole von Lilienfeld
Journal:  Sci Data       Date:  2014-08-05       Impact factor: 6.444

8.  ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules.

Authors:  Justin S Smith; Olexandr Isayev; Adrian E Roitberg
Journal:  Sci Data       Date:  2017-12-19       Impact factor: 6.444

9.  Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning.

Authors:  Justin S Smith; Benjamin T Nebgen; Roman Zubatyuk; Nicholas Lubbers; Christian Devereux; Kipton Barros; Sergei Tretiak; Olexandr Isayev; Adrian E Roitberg
Journal:  Nat Commun       Date:  2019-07-01       Impact factor: 14.919

10.  The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules.

Authors:  Justin S Smith; Roman Zubatyuk; Benjamin Nebgen; Nicholas Lubbers; Kipton Barros; Adrian E Roitberg; Olexandr Isayev; Sergei Tretiak
Journal:  Sci Data       Date:  2020-05-01       Impact factor: 6.444

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.