| Literature DB >> 35831367 |
Rongen Yan1, Xue Jiang2,3, Weiren Wang2, Depeng Dang4, Yanjing Su5.
Abstract
Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ' solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.Entities:
Year: 2022 PMID: 35831367 PMCID: PMC9279422 DOI: 10.1038/s41597-022-01492-2
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Process for information extraction. Among them, B-A represents the name of the superalloy, and B-Val represents the property value. LF_1, LF_2, …, LF_n represent the name of labeling functions.
Examples of labeling functions.
| Labeling function | Description |
|---|---|
| LF_temperature_words | If the sentence contains words related to temperature, such as solvus, |
| LF_for | If the sentence contains “ ‘ |
| LF_oneMatch | If there is only one superalloy and one |
| LF_temperature_left | If the |
| LF_alloy_twoExpress | If a superalloy has two expressions, we label True. |
| LF_in | If there is a sentence of “ ‘ |
| LF_equal | If there is a keyword “equal” near the superalloy and |
| LF_hasTem | If there is a sentence pattern “has a temperature of”, we label True. |
| LF_between_and | If the |
| LF_be | If it is clear what the |
‘Description’ is an explanation of what the label function does.
Coverage, overlaps and conflicts of different labeling functions.
| Labeling function | Coverage | Overlaps | Conflicts |
|---|---|---|---|
| LF_temperature_words | 0.227451 | 0.078431 | 0.0 |
| LF_for | 0.027451 | 0.027451 | 0.0 |
| LF_oneMatch | 0.074510 | 0.068627 | 0.0 |
| LF_temperature_left | 0.066667 | 0.064706 | 0.0 |
| LF_alloy_twoExpress | 0.007843 | 0.007843 | 0.0 |
| LF_in | 0.460784 | 0.176471 | 0.0 |
| LF_equal | 0.003922 | 0.003942 | 0.0 |
| LF_hasTem | 0.009804 | 0.009804 | 0.0 |
| LF_between_and | 0.007843 | 0.007843 | 0.0 |
| LF_be | 0.033333 | 0.033333 | 0.0 |
Fig. 2The performance of F1-score and ROC-auc in the generated dataset. If the value is greater than 0.8, the model is working well.
An example of a manually corrected dataset.
| correct | name_id | attri_id | sentence | tokens |
|---|---|---|---|---|
| 1 | (2:2) | (16:17) | In particular Co-30Ni-12Al-41a-12Cr (12Cr)… | [In,particular,Co-30Ni-12Al-41a-12Cr,(, 12Cr,),…] |
| 1 | (4:4) | (16:17) | In particular Co-30Ni-12Al-41a-12Cr (12Cr)… | [In,particular,Co-30Ni-12Al-41a-12Cr,(, 12Cr,),…] |
| 1 | (12:12) | (15:16) | Moreover, it is worth noting that the solvus temperature value… | [Moreover,it,is,worth,noting,that,the,solvus,temperature,value,…] |
| 0 | (12:12) | (26:27) | Moreover, it is worth noting that the solvus temperature value… | [Moreover,it,is,worth,noting,that,the,solvus,temperature,value,…] |
| 0 | (12:12) | (36:36) | Moreover, it is worth noting that the solvus temperature value… | [Moreover,it,is,worth,noting,that,the,solvus,temperature,value,…] |
‘name_id’ and ‘attri_id’ represent the position of the name and attribute value in the sentence, respectively.
Fig. 3The internal structure of ON-LSTM, where σ is the activation function sigmoid, f is forget gate, i is input gate and o is output gate.
Fig. 4Comparison results of ON-LSTM and the algorithms proposed in previous articles. ON-LSTM is our proposed method.
F1-score for density, γ’ solvus temperature of superalloys and hardness information of high entropy alloys.
| Physical properties of materials | number of sentences | Number of candidates | F1-score |
|---|---|---|---|
| density of superalloys | 222 | 846 | 94.02 |
| solidus temperatures of superalloys | 128 | 348 | 89.27 |
| hardness of high entropy alloys | 155 | 472 | 85.81 |
Fig. 5The internal structure of LSTM. An LSTM cell consists of a memory cell c and three gates.