Literature DB >> 25810776

LeadMine: a grammar and dictionary driven approach to entity recognition.

Daniel M Lowe1, Roger A Sayle1.   

Abstract

BACKGROUND: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected. As recognition is driven by what is expected, if spelling errors occur, they can be corrected. Correcting such errors is highly useful when attempting to lookup an entity in a database or, in the case of chemical names, converting them to structures.
RESULTS: Our system uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. Our final system performs post-processing steps to modify the boundaries of entities and to detect abbreviations. These steps are shown to significantly improve performance (2.6% and 4.0% F1-score respectively). Our complete system, with incremental post-BioCreative workshop improvements, achieves 89.9% precision and 85.4% recall (87.6% F1-score) on the CHEMDNER test set.
CONCLUSIONS: Grammar and dictionary approaches can produce results at least as good as the current state of the art in machine learning approaches. While machine learning approaches are commonly thought of as "black box" systems, our approach directly links the output entities to the input dictionaries and grammars. Our approach also allows correction of errors in detected entities, which can assist with entity resolution.

Entities:  

Keywords:  Biocreative IV; CHEMDNER; LeadMine; chemical entity recognition; dictionaries; grammars

Year:  2015        PMID: 25810776      PMCID: PMC4331695          DOI: 10.1186/1758-2946-7-S1-S5

Source DB:  PubMed          Journal:  J Cheminform        ISSN: 1758-2946            Impact factor:   5.514


  10 in total

1.  A simple algorithm for identifying abbreviation definitions in biomedical text.

Authors:  Ariel S Schwartz; Marti A Hearst
Journal:  Pac Symp Biocomput       Date:  2003

2.  Improved chemical text mining of patents with infinite dictionaries and automatic spelling correction.

Authors:  Roger Sayle; Paul Hongxing Xie; Sorel Muresan
Journal:  J Chem Inf Model       Date:  2011-12-28       Impact factor: 4.956

3.  ChemSpot: a hybrid system for chemical named entity recognition.

Authors:  Tim Rocktäschel; Michael Weidlich; Ulf Leser
Journal:  Bioinformatics       Date:  2012-04-12       Impact factor: 6.937

4.  A dictionary to identify small molecules and drugs in free text.

Authors:  Kristina M Hettne; Rob H Stierum; Martijn J Schuemie; Peter J M Hendriksen; Bob J A Schijvenaars; Erik M van Mulligen; Jos Kleinjans; Jan A Kors
Journal:  Bioinformatics       Date:  2009-09-16       Impact factor: 6.937

5.  LeadMine: a grammar and dictionary driven approach to entity recognition.

Authors:  Daniel M Lowe; Roger A Sayle
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

Review 6.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.

Authors:  Miguel Vazquez; Martin Krallinger; Florian Leitner; Alfonso Valencia
Journal:  Mol Inform       Date:  2011-07-12       Impact factor: 3.353

7.  Detection of IUPAC and IUPAC-like chemical names.

Authors:  Roman Klinger; Corinna Kolárik; Juliane Fluck; Martin Hofmann-Apitius; Christoph M Friedrich
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

8.  OSCAR4: a flexible architecture for chemical text-mining.

Authors:  David M Jessop; Sam E Adams; Egon L Willighagen; Lezan Hawizy; Peter Murray-Rust
Journal:  J Cheminform       Date:  2011-10-14       Impact factor: 5.514

9.  The CHEMDNER corpus of chemicals and drugs and its annotation principles.

Authors:  Martin Krallinger; Obdulia Rabal; Florian Leitner; Miguel Vazquez; David Salgado; Zhiyong Lu; Robert Leaman; Yanan Lu; Donghong Ji; Daniel M Lowe; Roger A Sayle; Riza Theresa Batista-Navarro; Rafal Rak; Torsten Huber; Tim Rocktäschel; Sérgio Matos; David Campos; Buzhou Tang; Hua Xu; Tsendsuren Munkhdalai; Keun Ho Ryu; S V Ramanan; Senthil Nathan; Slavko Žitnik; Marko Bajec; Lutz Weber; Matthias Irmer; Saber A Akhondi; Jan A Kors; Shuo Xu; Xin An; Utpal Kumar Sikdar; Asif Ekbal; Masaharu Yoshioka; Thaer M Dieb; Miji Choi; Karin Verspoor; Madian Khabsa; C Lee Giles; Hongfang Liu; Komandur Elayavilli Ravikumar; Andre Lamurias; Francisco M Couto; Hong-Jie Dai; Richard Tzong-Han Tsai; Caglar Ata; Tolga Can; Anabel Usié; Rui Alves; Isabel Segura-Bedmar; Paloma Martínez; Julen Oyarzabal; Alfonso Valencia
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

10.  ChEBI: a database and ontology for chemical entities of biological interest.

Authors:  Kirill Degtyarenko; Paula de Matos; Marcus Ennis; Janna Hastings; Martin Zbinden; Alan McNaught; Rafael Alcántara; Michael Darsow; Mickaël Guedj; Michael Ashburner
Journal:  Nucleic Acids Res       Date:  2007-10-11       Impact factor: 16.971

  10 in total
  22 in total

1.  Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.

Authors:  Ilia Korvigo; Maxim Holmatov; Anatolii Zaikovskii; Mikhail Skoblov
Journal:  J Cheminform       Date:  2018-05-23       Impact factor: 5.514

2.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Authors:  Saber A Akhondi; Kristina M Hettne; Eelke van der Horst; Erik M van Mulligen; Jan A Kors
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

3.  LeadMine: a grammar and dictionary driven approach to entity recognition.

Authors:  Daniel M Lowe; Roger A Sayle
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

4.  Extracting Drug Names and Associated Attributes From Discharge Summaries: Text Mining Study.

Authors:  Ghada Alfattni; Maksim Belousov; Niels Peek; Goran Nenadic
Journal:  JMIR Med Inform       Date:  2021-05-05

5.  CheNER: a tool for the identification of chemical entities and their classes in biomedical literature.

Authors:  Anabel Usié; Joaquim Cruz; Jorge Comas; Francesc Solsona; Rui Alves
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

6.  CHEMDNER: The drugs and chemical names extraction challenge.

Authors:  Martin Krallinger; Florian Leitner; Obdulia Rabal; Miguel Vazquez; Julen Oyarzabal; Alfonso Valencia
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

7.  Annotated chemical patent corpus: a gold standard for text mining.

Authors:  Saber A Akhondi; Alexander G Klenner; Christian Tyrchan; Anil K Manchala; Kiran Boppana; Daniel Lowe; Marc Zimmermann; Sarma A R P Jagarlapudi; Roger Sayle; Jan A Kors; Sorel Muresan
Journal:  PLoS One       Date:  2014-09-30       Impact factor: 3.240

8.  CD-REST: a system for extracting chemical-induced disease relation in literature.

Authors:  Jun Xu; Yonghui Wu; Yaoyun Zhang; Jingqi Wang; Hee-Jin Lee; Hua Xu
Journal:  Database (Oxford)       Date:  2016-03-25       Impact factor: 3.451

Review 9.  Chemical named entities recognition: a review on approaches and applications.

Authors:  Safaa Eltyeb; Naomie Salim
Journal:  J Cheminform       Date:  2014-04-28       Impact factor: 5.514

10.  Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Authors:  Saber A Akhondi; Ewoud Pons; Zubair Afzal; Herman van Haagen; Benedikt F H Becker; Kristina M Hettne; Erik M van Mulligen; Jan A Kors
Journal:  Database (Oxford)       Date:  2016-05-02       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.