Literature DB >> 25810767

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Saber A Akhondi1, Kristina M Hettne2, Eelke van der Horst2, Erik M van Mulligen1, Jan A Kors1.   

Abstract

BACKGROUND: The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals.
RESULTS: The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions.
CONCLUSIONS: We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.

Entities:  

Keywords:  BioCreative; CHEMDNER; Chemical compounds; Chemical databases; Chemical dictionaries; Chemical identifiers; Chemical structure; Drugs; MOL files; Named entity recognition

Year:  2015        PMID: 25810767      PMCID: PMC4331686          DOI: 10.1186/1758-2946-7-S1-S10

Source DB:  PubMed          Journal:  J Cheminform        ISSN: 1758-2946            Impact factor:   5.514


  34 in total

1.  TTD: Therapeutic Target Database.

Authors:  X Chen; Z L Ji; Y Z Chen
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

2.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors:  A R Aronson
Journal:  Proc AMIA Symp       Date:  2001

Review 3.  A review of auditing methods applied to the content of controlled biomedical terminologies.

Authors:  Xinxin Zhu; Jung-Wei Fan; David M Baorto; Chunhua Weng; James J Cimino
Journal:  J Biomed Inform       Date:  2009-03-12       Impact factor: 6.317

4.  Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.

Authors:  Kristina M Hettne; Antony J Williams; Erik M van Mulligen; Jos Kleinjans; Valery Tkachenko; Jan A Kors
Journal:  J Cheminform       Date:  2010-03-23       Impact factor: 5.514

5.  HMDB: the Human Metabolome Database.

Authors:  David S Wishart; Dan Tzur; Craig Knox; Roman Eisner; An Chi Guo; Nelson Young; Dean Cheng; Kevin Jewell; David Arndt; Summit Sawhney; Chris Fung; Lisa Nikolai; Mike Lewis; Marie-Aude Coutouly; Ian Forsythe; Peter Tang; Savita Shrivastava; Kevin Jeroncic; Paul Stothard; Godwin Amegbey; David Block; David D Hau; James Wagner; Jessica Miniaci; Melisa Clements; Mulu Gebremedhin; Natalie Guo; Ying Zhang; Gavin E Duggan; Glen D Macinnis; Alim M Weljie; Reza Dowlatabadi; Fiona Bamforth; Derrick Clive; Russ Greiner; Liang Li; Tom Marrie; Brian D Sykes; Hans J Vogel; Lori Querengesser
Journal:  Nucleic Acids Res       Date:  2007-01       Impact factor: 16.971

6.  CheNER: a tool for the identification of chemical entities and their classes in biomedical literature.

Authors:  Anabel Usié; Joaquim Cruz; Jorge Comas; Francesc Solsona; Rui Alves
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

7.  A modular framework for biomedical concept recognition.

Authors:  David Campos; Sérgio Matos; José Luís Oliveira
Journal:  BMC Bioinformatics       Date:  2013-09-24       Impact factor: 3.169

8.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters.

Authors:  Christopher Funk; William Baumgartner; Benjamin Garcia; Christophe Roeder; Michael Bada; K Bretonnel Cohen; Lawrence E Hunter; Karin Verspoor
Journal:  BMC Bioinformatics       Date:  2014-02-26       Impact factor: 3.169

9.  The CHEMDNER corpus of chemicals and drugs and its annotation principles.

Authors:  Martin Krallinger; Obdulia Rabal; Florian Leitner; Miguel Vazquez; David Salgado; Zhiyong Lu; Robert Leaman; Yanan Lu; Donghong Ji; Daniel M Lowe; Roger A Sayle; Riza Theresa Batista-Navarro; Rafal Rak; Torsten Huber; Tim Rocktäschel; Sérgio Matos; David Campos; Buzhou Tang; Hua Xu; Tsendsuren Munkhdalai; Keun Ho Ryu; S V Ramanan; Senthil Nathan; Slavko Žitnik; Marko Bajec; Lutz Weber; Matthias Irmer; Saber A Akhondi; Jan A Kors; Shuo Xu; Xin An; Utpal Kumar Sikdar; Asif Ekbal; Masaharu Yoshioka; Thaer M Dieb; Miji Choi; Karin Verspoor; Madian Khabsa; C Lee Giles; Hongfang Liu; Komandur Elayavilli Ravikumar; Andre Lamurias; Francisco M Couto; Hong-Jie Dai; Richard Tzong-Han Tsai; Caglar Ata; Tolga Can; Anabel Usié; Rui Alves; Isabel Segura-Bedmar; Paloma Martínez; Julen Oyarzabal; Alfonso Valencia
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

10.  Comparison of concept recognizers for building the Open Biomedical Annotator.

Authors:  Nigam H Shah; Nipun Bhatia; Clement Jonquet; Daniel Rubin; Annie P Chiang; Mark A Musen
Journal:  BMC Bioinformatics       Date:  2009-09-17       Impact factor: 3.169

View more
  11 in total

1.  Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.

Authors:  Ilia Korvigo; Maxim Holmatov; Anatolii Zaikovskii; Mikhail Skoblov
Journal:  J Cheminform       Date:  2018-05-23       Impact factor: 5.514

2.  Extracting Drug Names and Associated Attributes From Discharge Summaries: Text Mining Study.

Authors:  Ghada Alfattni; Maksim Belousov; Niels Peek; Goran Nenadic
Journal:  JMIR Med Inform       Date:  2021-05-05

3.  CheNER: a tool for the identification of chemical entities and their classes in biomedical literature.

Authors:  Anabel Usié; Joaquim Cruz; Jorge Comas; Francesc Solsona; Rui Alves
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

4.  CHEMDNER: The drugs and chemical names extraction challenge.

Authors:  Martin Krallinger; Florian Leitner; Obdulia Rabal; Miguel Vazquez; Julen Oyarzabal; Alfonso Valencia
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

5.  Mining Chemical Activity Status from High-Throughput Screening Assays.

Authors:  Othman Soufan; Wail Ba-alawi; Moataz Afeef; Magbubah Essack; Valentin Rodionov; Panos Kalnis; Vladimir B Bajic
Journal:  PLoS One       Date:  2015-12-14       Impact factor: 3.240

6.  Automatic identification of relevant chemical compounds from patents.

Authors:  Saber A Akhondi; Hinnerk Rey; Markus Schwörer; Michael Maier; John Toomey; Heike Nau; Gabriele Ilchmann; Mark Sheehan; Matthias Irmer; Claudia Bobach; Marius Doornenbal; Michelle Gregory; Jan A Kors
Journal:  Database (Oxford)       Date:  2019-01-01       Impact factor: 3.451

7.  Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes.

Authors:  Huiwei Zhou; Shixian Ning; Zhe Liu; Chengkun Lang; Zhuang Liu; Bizun Lei
Journal:  BMC Bioinformatics       Date:  2020-01-30       Impact factor: 3.169

8.  Improving biomedical named entity recognition with syntactic information.

Authors:  Yuanhe Tian; Wang Shen; Yan Song; Fei Xia; Min He; Kenli Li
Journal:  BMC Bioinformatics       Date:  2020-11-25       Impact factor: 3.169

9.  ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.

Authors:  Jiayuan He; Dat Quoc Nguyen; Saber A Akhondi; Christian Druckenbrodt; Camilo Thorne; Ralph Hoessel; Zubair Afzal; Zenan Zhai; Biaoyan Fang; Hiyori Yoshikawa; Ameer Albahem; Lawrence Cavedon; Trevor Cohn; Timothy Baldwin; Karin Verspoor
Journal:  Front Res Metr Anal       Date:  2021-03-25

10.  Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Authors:  Saber A Akhondi; Ewoud Pons; Zubair Afzal; Herman van Haagen; Benedikt F H Becker; Kristina M Hettne; Erik M van Mulligen; Jan A Kors
Journal:  Database (Oxford)       Date:  2016-05-02       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.