| Literature DB >> 33665573 |
Olga Kononova1,2, Tanjin He1,2, Haoyan Huo1,2, Amalie Trewartha2, Elsa A Olivetti3, Gerbrand Ceder1,2.
Abstract
Research publications are the major repository of scientific knowledge. However, their unstructured and highly heterogenous format creates a significant obstacle to large-scale analysis of the information contained within. Recent progress in natural language processing (NLP) has provided a variety of tools for high-quality information extraction from unstructured text. These tools are primarily trained on non-technical text and struggle to produce accurate results when applied to scientific text, involving specific technical terminology. During the last years, significant efforts in information retrieval have been made for biomedical and biochemical publications. For materials science, text mining (TM) methodology is still at the dawn of its development. In this review, we survey the recent progress in creating and applying TM and NLP approaches to materials science field. This review is directed at the broad class of researchers aiming to learn the fundamentals of TM as applied to the materials science publications.Entities:
Keywords: Computational Materials Science; Computing Methodology; Data Analysis; Materials Design
Year: 2021 PMID: 33665573 PMCID: PMC7905448 DOI: 10.1016/j.isci.2021.102155
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Publication trend over the past 14 years
Top panel: Number of publications appearing every year in different fields of materials science. All data were obtained by manually querying Web of Science publications resource. The analysis includes only research articles, communications, letters, and conference proceedings. The number of publications is on the order of 103. Bottom panel: Relative comparison of the fraction of scientific papers available on-line as image PDF or embedded PDF versus articles in HTML/XML format. The gray arrow marks time intervals for both top and bottom panels.
Figure 2Schematic representation of the standard text mining pipeline for information extraction from the scientific publications
List of some common text repositories in chemistry and material science subjects that provide an API for querying
| Data repository | Documents types | Access | Reference |
|---|---|---|---|
| CAplus | Research articles, patents, reports | Subscription | |
| DOAJ | Research articles (open-access only) | Public | |
| PubMed Central | Research articles | Public | |
| Science Direct (Elsevier) | Research articles | Subscription | |
| Scopus (Elsevier) | Abstracts | Public | |
| Springer Nature | Research articles, books chapters | Subscription |
Note 1: Elsevier provides API for both Science Direct (collection of Elsevier published full-text) and Scopus (collection of abstracts from various publishers). Note 2: Springer Nature provides access only to its own published full texts.
Examples of how different tokenizers split sentences into tokens
| Reagents | | |
| Reagents | | |
| Reagents | | |
| Reagents | | |
| Reagents | | |
| We | made | | |
| We | made | | |
| We | made | | |
| We | made | | |
| We | made | | |
NLTK (Bird et al., 2009) and SpaCy (Honnibal and Johnson, 2015) are general-purpose tokenizing tools, whereas ChemDataExtractor (Swain and Cole, 2016), OSCAR4 (Jessop et al., 2011), ChemicalTagger (Hawizy et al., 2011) are the tools trained for a scientific corpus. Tokens are bound by “” symbol.
Figure 3Schematic representation of various information types that can be extracted from a typical materials science paper
Examples of chemical NER extraction
| – | |
| ‘Manganese’ ( | |
| ‘Aqueous’, ‘lithium’, ‘cobalt’, ‘manganese’, ‘nitrates’, ‘water’ | |
| ‘Lithium’, ‘cobalt’, ‘manganese nitrates’ | |
| ‘Lithium’, ‘cobalt’, ‘manganese nitrates’ | |
| ‘Lithium’, ‘cobalt’, ‘manganese nitrates’, ‘water’ | |
| ‘Lithium, cobalt, and manganese nitrates’, ‘water’ | |
| – | |
| – | |
| ‘Ce3+’, ‘Eu2+’, ‘Ca2Si5N8’ | |
| ‘Ce3+-Eu2+’, ‘Ca2Si5N8’ | |
| ‘Ce3+-Eu2+’, ‘Ca2Si5N8’ | |
| ‘Ce3+-Eu2’, ‘co’, ‘Ca2Si5N8’ | |
| ‘Ce3+-Eu2+ co-doped Ca2Si5N8’ | |
| ‘NO3’, ‘NO3’, ‘CH3COO’ ( | |
| ‘Bi2Cu1-xNixO4’ ( | |
| ‘Bi(NO3)3·5H2O’, ‘Ni(NO3)2·6H2O’, ‘Cu(CH3COO)2·H2O’ | |
| ‘Bi(NO3)3·5H2O’, ‘Ni(NO3)2·6H2O’, ‘Cu(CH3COO)2·H2O’, ‘Bi2Cu1-xNixO4’ | |
| ‘Bi(NO3)3·5H2O’, ‘Ni(NO3)2·6H2O’, ‘Cu(CH3COO)2·H2O’, ‘Bi2Cu1-xNixO4’ | |
| ‘Bi(NO3)3·5H2O’, ‘Ni(NO3)2·6H2O’, ‘Cu(CH3COO)2·H2O’, ‘Bi2Cu1-xNixO4’ | |
| ‘Bi(NO3)3·5H2O’, ‘Ni(NO3)2·6H2O’, ‘Cu(CH3COO)2·H2O’, ‘Bi2Cu1-xNixO4’ | |
Examples of the chemical named entities extracted by the general-purpose NER tools NLTK (Bird et al., 2009) and SpaCy (Honnibal and Johnson, 2015), and the tools trained on chemical corpus OSCAR4 (Jessop et al., 2011), tmChem (Leaman et al., 2015), ChemSpot (Rocktäschel et al., 2012), ChemDataExtractor (Swain and Cole, 2016), BiLSTM chemical NER (He et al., 2020). For the general-purpose tools, the assigned labels are given in parenthesis. For the chemical NERs, only entities labeled as chemical compounds are shown.
Figure 4Accuracy of chemical NER extraction
(A) Precision and recall of the published models for chemical NER manually extracted from the reports
Color denotes the primary algorithm underlying the model.
(B) Accuracy of the data extracted from materials synthesis paragraphs plotted against the complexity of the paragraphs. The accuracy is computed using chemical NER models developed by our team (Kononova et al., 2019; He et al., 2020) to the manually annotated paragraphs. The text complexity is calculated as a Flesch-Kincaid grade level (FKGL) score indicating the education level required to understand the paragraph (Kincaid et al., 1975). ρ is a Pearson correlation coefficient between the accuracy of NER model and the FKGL score.