Literature DB >> 24384710

Chemical name extraction based on automatic training data generation and rich feature set.

Su Yan1, W Scott Spangler1, Ying Chen1.   

Abstract

The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 24384710     DOI: 10.1109/TCBB.2013.101

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  3 in total

1.  A hybrid approach for automated mutation annotation of the extended human mutation landscape in scientific literature.

Authors:  Antonio Jimeno Yepes; Andrew MacKinlay; Natalie Gunn; Christine Schieber; Noel Faux; Matthew Downton; Benjamin Goudey; Richard L Martin
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

2.  Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.

Authors:  Yaoyun Zhang; Jun Xu; Hui Chen; Jingqi Wang; Yonghui Wu; Manu Prakasam; Hua Xu
Journal:  Database (Oxford)       Date:  2016-04-17       Impact factor: 3.451

3.  Ratiometric Decoding of Pheromones for a Biomimetic Infochemical Communication System.

Authors:  Guangfen Wei; Sanju Thomas; Marina Cole; Zoltán Rácz; Julian W Gardner
Journal:  Sensors (Basel)       Date:  2017-10-30       Impact factor: 3.576

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.