| Literature DB >> 35504897 |
Qingyang Dong1, Jacqueline M Cole2,3,4.
Abstract
Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a 'chemistry-aware' software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery.Entities:
Year: 2022 PMID: 35504897 PMCID: PMC9065101 DOI: 10.1038/s41597-022-01294-6
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Descriptions and examples of Snowball objects.
| Object name | Description |
|---|---|
| Sentence | Sentence split from body text. |
| Relationship tuple | An entity list made of chemical name, property specifier, property value, and property unit. |
| Prefix | Word tokens before the first entity. |
| Middle | Word tokens between each relationship entity. |
| Suffix | Word tokens after the last entity. |
| Weight | Normalized importance factor for prefix, middles, and suffix. |
| Phrase object | A normalized vector of entities. |
An example of a sub-cluster containing four phrases, selected from a trained Snowball model.
| Components | Description |
|---|---|
| Phrase 1 | This insulating Al2O3 has a wide band gap Eg of 7–9 eV and acts purely as a mesoporous scaffold for the perovskite (CH3NH3PbI2Cl) to be deposited. |
| Phrase 2 | In addition, ZnO has a wide band gap of 3.37 eV, which inevitably restricts its practical application in visible light or sunlight. |
| Phrase 3 | However, TiO2 has a wide band gap of 3.2 eV which limits its application under visible light. |
| Phrase 4 | Pure TiO2 has a band gap of 3.2 eV and on loading CoOx, the band gap shifted to the visible region, as shown in Table |
| Centroid extraction pattern | (compound_names) has a wide (bandgap_specifier) of (bandgap_raw_value) <Blank> (bandgap_raw_units) |
| Confidence | 1.0 |
Fig. 1The general workflow of the original ChemDataExtractor version 2.0 (top), and the workflow adopted in this work (bottom).
Description of the band gap records and their attributes.
| Key | Description | Data type |
|---|---|---|
| Name | Chemical compound names | String |
| Composition | Elements and their numbers of atoms per molecule | Dictionary |
| Value | Normalized band gap value | List of floats |
| Unit | Normalized band gap unit | String |
| Raw_value | Text string of band gap value | String |
| Raw_unit | Text string of band gap unit | String |
| Temperature_value | Normalized temperature value | List of floats |
| Temperature_unit | Normalized temperature unit | String |
| Temperature_raw_value | Text string of temperature value | String |
| Temperature_raw_unit | Text string of temperature unit | String |
| AutoSentenceParser | Source of the data record | Boolean |
| Snowball | Source of the data record | Boolean |
| BandgapDB | Source of the data record | Boolean |
| Confidence | Confidence score from Snowball model | Float |
| Text | Sentence from which data is extracted | String |
| Publisher | Name of the publisher of the paper | String |
| DOI | DOI of the paper | String |
| Notes | Additional reference for the data record | String |
Evaluation results of AutoSentenceParser and the Snowball model.
| Parser | Total extracted | Removed | TP | Precision | Recall | F-score | |
|---|---|---|---|---|---|---|---|
| AutoSentenceParser | N/A | 212 | 53 | 115 | 72.3% | 55.0% | 62.5% |
| Snowball | 95% | 12 | 0 | 12 | 100.0% | 5.7% | 10.9% |
| Snowball | 90% | 25 | 0 | 25 | 100.0% | 12.0% | 21.4% |
| Snowball | 85% | 39 | 0 | 39 | 100.0% | 18.7% | 31.5% |
| Snowball | 80% | 67 | 5 | 59 | 95.2% | 28.2% | 43.5% |
| Snowball | 75% | 101 | 6 | 89 | 93.7% | 42.6% | 58.6% |
| Snowball | 70% | 123 | 12 | 102 | 91.9% | 48.8% | 63.8% |
| Snowball | 65% | 141 | 20 | 107 | 88.4% | 51.2% | 64.8% |
| Snowball | 60% | 157 | 27 | 106 | 81.5% | 50.7% | 62.5% |
| Snowball | 55% | 169 | 34 | 107 | 79.3% | 51.2% | 62.2% |
| Snowball | dynamic | 147 | 21 | 113 | 89.7% | 54.1% | 67.5% |
The fourth column “Removed” is the number of records that is deleted during post processing.
Fig. 2Performance of the Snowball parser and AutoSentenceParser against (a) and (b), evaluated at dotted points. In (a), is set to 65%; in (b), is set to 0%. For comparison, results of AutoSentenceParser are represented in dashed lines.
Fig. 3Band gap energy distribution of all records in the database (a) and titanium dioxide (b). Dashed red lines indicate the lower and upper bounds of visible spectrum at 1.65 eV and 3.10 eV.
| Measurement(s) | semiconductor band gaps |
| Technology Type(s) | natural language processing |