| Literature DB >> 35614129 |
Zheren Wang1,2, Olga Kononova1,2, Kevin Cruse1,2, Tanjin He1,2, Haoyan Huo1,2, Yuxing Fei1,2, Yan Zeng2, Yingzhi Sun1,2, Zijian Cai1,2, Wenhao Sun3, Gerbrand Ceder4,5.
Abstract
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanced machine learning and natural language processing techniques to construct a dataset of 35,675 solution-based synthesis procedures extracted from the scientific literature. Each procedure contains essential synthesis information including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every procedure is also augmented with the reaction formula. Through this work, we are making freely available the first large dataset of solution-based inorganic materials synthesis procedures.Entities:
Year: 2022 PMID: 35614129 PMCID: PMC9132903 DOI: 10.1038/s41597-022-01317-2
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Extraction pipeline and example. Top panel: Schematic representation of the standard text mining pipeline: (i) scrape papers in markup format from the major publishers; (ii) identify and classify synthesis sections; (iii) extract key information including materials, amounts, sequenced operations, and conditions; (iv) store synthesis procedures into the database for future data mining. Bottom panel: Example of a codified procedure extracted from a synthesis paragraph.
Format of each data record: description, key label, data type.
| Data description | Data Key Label | Data Type |
|---|---|---|
| DOI of the original paper | doi | |
| Snippet of the raw text | paragraph_string | |
| Chemical formula | reaction | Object |
| - left_side: | ||
| - right_side: | ||
| Chemical formula in string format | reaction_string | |
| Target material data | target | Object |
| - material_string: | ||
| - material_formula: | ||
| - composition: | ||
| - additives: | ||
| - elements_vars: {var: | ||
| - amounts_vars: {var: | ||
| - oxygen_deficiency: | ||
| - mp_id: | ||
| List of target formulas obtained after variables substitution | targets_string | list of |
| Precursor materials data | precursors | |
| List of solvent formulas | solvents_string | |
| Sequence of synthesis steps and corresponding conditions | operations | |
| - token: string, | ||
| - type: | ||
| - conditions: Object | ||
| –temperature: | ||
| –time: | ||
| –atmosphere: | ||
| –mixing_device: | ||
| –mixing_media: | ||
| Materials with corresponding quantities | quantities | |
| - material: | ||
| - quantity: | ||
| Synthesis type | type |
1{formula: string, elements: {elements: amount of element}, amount: string}.
2{max_value: float, min_value: float, values: list of floats}.
3{max_value: float, min_value: float, values: list of floats, units: string}.
4{number: float, unit: string}.
Performance of data extraction for dataset entries.
| Data attribute | Precision | Recall | F1 score |
|---|---|---|---|
| Balanced reactions | 0.94 | / | / |
| - targets | 0.97 | / | / |
| - precursors | 0.98 | 0.99 | 0.98 |
| Operations | 0.96 | 0.85 | 0.90 |
| Conditions | |||
| - temperature | 0.97 | 0.92 | 0.94 |
| - time | 0.98 | 0.89 | 0.93 |
| - atmosphere | 0.97 | 0.92 | 0.94 |
| Quantities | 0.90 | 0.85 | 0.87 |
Ten most common targets in the dataset with their corresponding precursors.
| Targets | Common Precursors |
|---|---|
| ZnO | Zn(NO3)2, Zn(Ac)2, ZnCl2 |
| TiO2 | Ti(OCH(CH3))4, Ti(OC4H9)4, TiCl4 |
| Fe3O4 | FeCl3, FeCl2 |
| Fe2O3 | FeCl3, Fe(NO3)3 |
| SnO2 | SnCl4 |
| ZrO2 | ZrOCl2, ZrO(NO3)2 |
| CuO | Cu(NO3)2, Cu(Ac)2, CuCl2, CuSO4 |
| SiO2 | Si(OC2H5)4 |
| WO3 | Na2WO4, WCl6, H2WO4 |
| CdS | Na2S, CH4N2S, CdCl2, Cd(NO3)2 |
Fig. 2The chemical space covered by the dataset. For each element, the box containing the element name is colored in a yellow-to-navy blue gradient representing the total amount of reactions that produce a target compound containing the element. The bar graph below each element shows the list of ions paired with the element in precursor compounds. The fractions of the precursors (i.e. element + ion) used are shown by the length of the bars. Boxes with no bar graph represent elements occurring in five and fewer targets. “Ac” stands for acetate radical CH3COO− in the compound formula.
Fig. 3Correspondence between choice of synthesis route and selected types of targets. The top table gives an example of the four synthesis categories defined: with heat treatment step, aqueous, non-aqueous, and mixed. The two pie-charts on the top-right show the fractions of synthesis routes in the hydrothermal and precipitation datasets separately. The four rows of pie charts in the lower half of the figure represent the fractions of the four synthesis routes (given in the table) for all oxides, all sulfides, and individual oxides and sulfides with different oxidation states of data-rich transition metals separately. The first and second rows are results from the hydrothermal dataset. The third and fourth rows are results from the precipitation dataset. Each blank space means that there is not enough data to form a statistic for the corresponding type of target.
| Measurement(s) | solution-based inorganic synthesis data |
| Technology Type(s) | natural language processing |