Literature DB >> 27669338

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

Matthew C Swain1, Jacqueline M Cole1.   

Abstract

The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .

Mesh:

Year:  2016        PMID: 27669338     DOI: 10.1021/acs.jcim.6b00207

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  34 in total

1.  Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.

Authors:  Ilia Korvigo; Maxim Holmatov; Anatolii Zaikovskii; Mikhail Skoblov
Journal:  J Cheminform       Date:  2018-05-23       Impact factor: 5.514

2.  Polymer Informatics: Opportunities and Challenges.

Authors:  Debra J Audus; Juan J de Pablo
Journal:  ACS Macro Lett       Date:  2017-09-15       Impact factor: 6.903

3.  Data Sets Representative of the Structures and Experimental Properties of FDA-Approved Drugs.

Authors:  Dominique Douguet
Journal:  ACS Med Chem Lett       Date:  2018-01-29       Impact factor: 4.345

Review 4.  Artificial Intelligence Applied to Battery Research: Hype or Reality?

Authors:  Teo Lombardo; Marc Duquesnoy; Hassna El-Bouysidy; Fabian Årén; Alfonso Gallo-Bueno; Peter Bjørn Jørgensen; Arghya Bhowmik; Arnaud Demortière; Elixabete Ayerbe; Francisco Alcaide; Marine Reynaud; Javier Carrasco; Alexis Grimaud; Chao Zhang; Tejs Vegge; Patrik Johansson; Alejandro A Franco
Journal:  Chem Rev       Date:  2021-09-16       Impact factor: 72.087

5.  Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor.

Authors:  Edward J Beard; Jacqueline M Cole
Journal:  Sci Data       Date:  2022-06-17       Impact factor: 8.501

Review 6.  Reconstructing Chromatic-Dispersion Relations and Predicting Refractive Indices Using Text Mining and Machine Learning.

Authors:  Jiuyang Zhao; Jacqueline M Cole
Journal:  J Chem Inf Model       Date:  2022-05-19       Impact factor: 6.162

7.  A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor.

Authors:  Jiuyang Zhao; Jacqueline M Cole
Journal:  Sci Data       Date:  2022-05-03       Impact factor: 8.501

8.  A database of battery materials auto-generated using ChemDataExtractor.

Authors:  Shu Huang; Jacqueline M Cole
Journal:  Sci Data       Date:  2020-08-06       Impact factor: 6.444

9.  Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems.

Authors:  John A Keith; Valentin Vassilev-Galindo; Bingqing Cheng; Stefan Chmiela; Michael Gastegger; Klaus-Robert Müller; Alexandre Tkatchenko
Journal:  Chem Rev       Date:  2021-07-07       Impact factor: 60.622

10.  Machine-learned and codified synthesis parameters of oxide materials.

Authors:  Edward Kim; Kevin Huang; Alex Tomala; Sara Matthews; Emma Strubell; Adam Saunders; Andrew McCallum; Elsa Olivetti
Journal:  Sci Data       Date:  2017-09-12       Impact factor: 6.444

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.