| Literature DB >> 35277533 |
Aditya Nandy1,2, Gianmarco Terrones1, Naveen Arunachalam1, Chenru Duan1,2, David W Kastner1,3, Heather J Kulik4.
Abstract
We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal-organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models.Entities:
Year: 2022 PMID: 35277533 PMCID: PMC8917177 DOI: 10.1038/s41597-022-01181-0
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Workflows for curating datasets for solvent removal and thermal stability. First, we use sanitized MOFs from published works, filter by structures that can be featurized, obtain manuscripts corresponding to structures, download these manuscripts to prepare them for natural language processing, and finally text mine the manuscripts to identify mentions of solvent removal stability or thermogravimetric analysis data. We identify thermogravimetric analysis traces from manuscripts with thermogravimetric analysis keywords. The two sets of data gathered during this workflow are then used to train machine learning models.
Keywords used for regular expression searches for solvent removal and thermal stabilities.
| stemmed keyword | keyword category | words identified | stability type |
|---|---|---|---|
| collaps | collapse | collaps(e/ed/es/ing) | solvent removal |
| deform | collapse | deform(ed/s/ing/ation) | solvent removal |
| amorph | collapse | amorph(ous/ize) | solvent removal |
| blockage | collapse | blockage | solvent removal |
| degrad | collapse | degrad(e/ed/es/ing/ation) | solvent removal |
| unstable | collapse | unstable | solvent removal |
| instability | collapse | instability | solvent removal |
| destroy | collapse | destroy(ed/s/ing) | solvent removal |
| one step weight | collapse | one(−)step weight loss | solvent removal |
| single step weight | collapse | single(−)step weight loss | solvent removal |
| stable | stable | stable | solvent removal |
| stability | stable | stability | solvent removal |
| preserv | stable | preserv(e/ed/es/ing) | solvent removal |
| crystallinity | stable | crystallinity | solvent removal |
| coordinatively unsaturat | stable | coordinatively unsaturat(ed/ing) | solvent removal |
| porosity | stable | (micro)porosity | solvent removal |
| retain | stable | retain(ed/s/ing) | solvent removal |
| maintain | stable | maintain(ed/s/ing) | solvent removal |
| two step weight | stable | two(-)step weight loss | solvent removal |
| solvent | solvent | solvent(s) | solvent removal |
| guest | solvent | guest(s) | solvent removal |
| desolvat | solvent | desolvat(e/ed/es/ing) | solvent removal |
| remov | solvent | remov(e/ed/es/ing) | solvent removal |
| activat | solvent | activat(e/ed/es/ing) | solvent removal |
| evacuat | solvent | evacuat(e/ed/es/ing) | solvent removal |
| dehydrat | solvent | dehydrat(e/ed/es/ing) | solvent removal |
| eliminat | solvent | eliminat(e/ed/es/ing) | solvent removal |
| water, H2O | solvent | water, H2O | solvent removal |
| DMF, formamide | solvent | DMF, formamide | solvent removal |
| DMA, methylamine, diamine | solvent | DMA, methylamine, diamine | solvent removal |
| EtOH, MeOH, ethanol, methanol | solvent | EtOH, MeOH, ethanol, methanol | solvent removal |
| pyrrolidone | solvent | pyrrolidone | solvent removal |
| TG | thermal | TG(A) | thermal |
| thermogravimetric | thermal | thermogravimetric analysis | thermal |
| thermal gravimetric | thermal | thermal(−)gravimetric analysis | thermal |
Stemmed forms of each word were used to identify keywords that have different tenses or forms. We label each word with a category and the type of stability that it identified.
Fig. 2Validation of the solvent removal and thermal stability data sets. (a) Comparison of NLP-assigned labels to hand-assigned labels over a 100 MOF subset, with stable NLP-assigned stable labels in blue and NLP-assigned unstable labels in orange. Cases that were correctly assigned are shown with a green outer ring, those that were incorrect are shown with a red outer ring, and ambiguous cases are shown with a gray outer ring. (b) Assignment of Td from TGA traces (top right, TGA traces adapted from ref. [80]) shown for two MOFs (SANGUM and SANHOH), with Td values inset. (c) The distribution of Td over the full thermal stability dataset is shown, with the MOF containing the lowest (WEVQOD01) and highest (IFAREN) thermal decomposition temperatures shown inset.
Description of revised autocorrelation (RAC) features with start/scope, operation performed, count of features removed, and total feature count.
| start | scope | operation | features removed | feature count |
|---|---|---|---|---|
| mc | all | product | 1 (mc-I-0-all) | 19 |
| mc | all | difference | 8 (Dmc-I-0-all, Dmc-I-1-all, Dmc-I-2-all, Dmc-I-3-all, Dmc-S-0-all, Dmc-T-0-all, Dmc-Z-0-all, Dmc-χ-0-all) | 12 |
| lc | linker | product | 1 (lc-I-0-linker) | 19 |
| lc | linker | difference | 8 (Dlc-I-0-linker, Dlc-I-1-linker, Dlc-I-2-linker, Dlc-I-3-linker, Dlc-S-0-linker, Dlc-T-0-linker, Dlc-Z-0-linker, Dlc-χ-0-linker) | 12 |
| func | linker | product | 0 | 20 |
| func | linker | difference | 8 (Dfunc-I-0-linker, Dfunc-I-1-linker, Dfunc-I-2-linker, Dfunc-I-3-linker, Dfunc-S-0-linker, Dfunc-T-0-linker, Dfunc-Z-0-linker, Dfunc-χ-0-linker) | 12 |
| full | all | product | 0 | 20 |
| full | linker | product | 0 | 20 |
| 26 | 134 |
Five heuristic atom-wise quantities are used to perform all product and difference operations: nuclear charge (Z), electronegativity (χ), topology (T), identity (I), and covalent radius (S). MOF RACs contain four possible starts and two possible scopes: metal-centered (mc) start, linker coordinating atom centered (lc) start, functional group centered (func) start, every atom (full) start, all atom in primitive cell (all) scope, or all atom in linker (linker) scope. All starts, scopes and operations use bond depths of 0, 1, 2, and 3 to generate autocorrelations (for a total of 20 possible features for each scope). Cases that are invariant across all MOFs are listed in the “features removed” column. RAC features are given using the notation:
Description of geometric features generated by Zeo++ with definitions and units.
| variable name | explanation | units |
|---|---|---|
| Df | maximum free sphere | Å |
| Di | maximum included sphere | Å |
| Dif | maximum included sphere in the free sphere path | Å |
| GPOAV | gravimetric pore accessible volume | cm3/g |
| GPONAV | gravimetric pore non-accessible volume | cm3/g |
| GPOV | gravimetric pore volume | cm3/g |
| GSA | gravimetric surface area | m2/g |
| POAV | pore accessible volume | Å3 |
| PONAV | pore non-accessible volume | Å3 |
| POAVF | pore accessible volume fraction | unitless |
| PONAVF | pore non-accessible volume fraction | unitless |
| VPOV | volumetric pore volume | cm3/cm3 |
| VSA | volumetric surface area | m2/cm3 |
| ρ | crystal density | g/cm3 |
Fig. 3Sections of the MOFSimplify web interface. (a) Interface for selecting a MOF for analysis and predicting properties of the selected MOF using ANNs trained on experimental data mined from the literature. The default MOF loaded upon selecting “Example MOF” is HKUST-1, a well-studied MOF[85]. (b) The feedback interface for evaluating model predictions. (c) The interface listing similar (i.e., LSNN) MOFs to the selected MOF as determined by the ANNs. (d) Visualization of the selected MOF’s components. (e) Visualization of the selected MOF’s unit cell.
| Measurement(s) | thermal decomposition |
| Technology Type(s) | thermogravimetry |