Anabel Usié1, Rui Alves, Francesc Solsona, Miguel Vázquez, Alfonso Valencia. 1. Department of Basic Medical Science (CMB), University of Lleida & IRBLleida, Department of Computers an Industrial Engineering (DIEI), University of Lleida, Lleida and Structural Biology and Biocomputing Programme, Spanish National Cancer Research Center (CNIO), Madrid, Spain.
Abstract
MOTIVATION: Chemical named entity recognition is used to automatically identify mentions to chemical compounds in text and is the basis for more elaborate information extraction. However, only a small number of applications are freely available to identify such mentions. Particularly challenging and useful is the identification of International Union of Pure and Applied Chemistry (IUPAC) chemical compounds, which due to the complex morphology of IUPAC names requires more advanced techniques than that of brand names. RESULTS: We present CheNER, a tool for automated identification of systematic IUPAC chemical mentions. We evaluated different systems using an established literature corpus to show that CheNER has a superior performance in identifying IUPAC names specifically, and that it makes better use of computational resources. AVAILABILITY AND IMPLEMENTATION: http://metres.udl.cat/index.php/9-download/4-chener, http://chener.bioinfo.cnio.es/
MOTIVATION: Chemical named entity recognition is used to automatically identify mentions to chemical compounds in text and is the basis for more elaborate information extraction. However, only a small number of applications are freely available to identify such mentions. Particularly challenging and useful is the identification of International Union of Pure and Applied Chemistry (IUPAC) chemical compounds, which due to the complex morphology of IUPAC names requires more advanced techniques than that of brand names. RESULTS: We present CheNER, a tool for automated identification of systematic IUPAC chemical mentions. We evaluated different systems using an established literature corpus to show that CheNER has a superior performance in identifying IUPAC names specifically, and that it makes better use of computational resources. AVAILABILITY AND IMPLEMENTATION: http://metres.udl.cat/index.php/9-download/4-chener, http://chener.bioinfo.cnio.es/
Automated NER (named entity recognition) of chemical compounds is receiving increased attention from researchers because it can facilitate the application of information extraction to the pharmaceutical treatment of diseases and to understanding how those compounds modulate gene/protein activities. Chemical NER draws from the experience in performing gene and protein NER (Smith, 2008), but differs from it in three ways.First, catalogs of names and compositions of chemical compounds have been traditionally less accessible. Fortunately, freely available chemical databases such as PubChem (Li ) or DrugBank (Wishart ) are helping to correct this issue. This makes it possible to do NER of common drug names such as ‘Aspirin’ or ‘Acetone’ by using a dictionary-based approach.Second, the complexities and the variability in the morphological structure of systematic IUPAC (International Union of Pure and Applied Chemistry) chemical names (McNaught and Wilkinson, 1997) make it impossible to create a finite dictionary of such names. This poses the main challenge for NER of chemical names (Vazquez ). IUPAC names can be simple words, or contain different punctuation marks, sequences of numbers separated by commas and so forth. They can also be combined in different forms (e.g. ‘18-bromo-12-butyl-11-chloro-4,8-diethyl-5-hydroxy-15-methoxy’), making it impossible to enumerate them all. This means that NER of such names cannot be done using a dictionary matching, requiring alternative approaches.Third, systematic nomenclatures of chemicals, like IUPAC, can be used directly to unambiguously derive their chemical structure.The number of applications that are freely available to do NER of common and systematic names of chemical compounds is still incipient, and their usability, efficiency and accuracy are far from perfect. To help alleviate these problems, in this work we present and benchmark CheNER, a machine learning application based on conditional random fields (CRFs) that performs NER of IUPAC chemical entities with improved performance over comparable tools.
2 METHODS
CheNER uses linear CRFs to predict the locations of IUPAC entity mentions in text. CRFs are a probabilistic framework for the labeling or segmentation of sequential data (Lafferty ).The training and benchmarking of the application was done using the corpora provided by Kolářik and Klinger (Klinger ; Kolářik ). The corpora are divided into a training corpus (, 463 abstracts, 5072 annotated entities), a Medline test corpus with a small number of entities (, 1000 abstracts, 165 annotated entities) and an evaluation corpus with a large number of entities (, 100 abstracts, 1310 annotated entities). All corpora contain annotated chemical entities written using the IUPAC nomenclature and other types of chemical names. CheNER’s CRF was trained on . Its performance was subsequently evaluated independently on both, and .In training our CRF, we defined a set of features and tested different combinations of them, together with two types of tokenization (A: by spaces, B: by punctuation marks), different orders of CRF (1 or 2) and different sizes of offsets conjunction or sliding windows (0,1), which creates a new additional feature of a token by conjoining its features with those of the n (n = 0, n = 1) surrounding tokens. We then selected the best combination, indicated by the highest F-score value obtained in cross-validation over the training set, as a model to use in the evaluation. The selected model performs with an F-score value of 80.20% (precision: 82.84%; recall: 77.74%), uses a second order CRF, an offset conjunction of 1, tokenization type A and a particular set of features described in the Supplementary Materials. To mark chemical mentions and establish borders between tokens during training, we used the IOB labeling scheme (Vazquez ). Details about the tested sets of features, training and evaluation corpora, training process, modeling assumption, performance and selection are described in Sections 1–3 of the Supplementary Materials.
3 RESULTS
3.1 Comparative performance for NER of chemical names
The predictive capability of CheNER for IUPAC names was evaluated using the and the corpora, performing the evaluation by comparing the system output with a gold standard in terms of the precision (p), recall (r) and F-score (F).There are, to our knowledge, only two other freely available tools for chemical NER. These are ChemSpot (Rocktäschel ) and OSCAR4 (Jessop ). To compare CheNER’s performance with that of those tools, we use the three applications to independently annotate and and compare the results. Our analysis shows that CheNER outperforms the other two applications in the experiments regarding IUPAC names alone (see Fig. 1) due to the fact that it was trained specifically for them. Note that OSCAR4 and ChemSpot do not differentiate between IUPAC and other types of chemical entities and will detect entities that, albeit chemical, will not be IUPAC and will register as false positives. To make the three methods comparable, we ignore non-IUPAC entities that are annotated in the corpora when evaluating performance. Unfortunately the corpus does not annotate non-IUPAC entities, so this corpus can only be compared in terms of recall. We find that CheNER performs better than OSCAR4 and ChemSpot in identifying IUPAC names. Details are given in Section 4 of Supplementary Materials.
Fig. 1.
Predictive capability of the different tools identifying IUPAC entities over (A) the corpus and (B) corpus. We measure the ability of the three tools to specifically identify IUPAC chemical entities in the two corpora
Predictive capability of the different tools identifying IUPAC entities over (A) the corpus and (B) corpus. We measure the ability of the three tools to specifically identify IUPAC chemical entities in the two corporaGiven that CheNER has been trained in the specialized task of recognizing IUPAC names, it is not surprising that when applied to non-IUPAC names it does not perform at the levels of other systems (see Section 4 of Supplementary Materials).
3.2 Comparative use of computational resources
We also evaluated how efficiently ChemSpot, OSCAR4 and CheNER use computing resources. We found that CheNER requires less physical memory, running in computers that have <3 GB of RAM, compared with minimum of 3 and 12 GB of RAM required by OSCAR4 and ChemSpot, respectively (see Supplementary Figs S3 and S4 and Section 4 of the Supplementary Materials for details).
4 DISCUSSION
Because IUPAC names are the standard in important types of documents, such as patents, and the chemical structure is often derivable from the mention itself, it is important to have an application specifically devised for their identification. Given the potentially infinite number of IUPAC entities, it is not feasible to develop a dictionary-based approach to identify them, and natural language processing methods are more suitable to identify those entities. Thus, we developed CheNER, an NER approach for finding IUPAC names in text, using CRFs. We demonstrate that CheNER annotates IUPAC names in documents with a better F-score than ChemSpot and OSCAR4. CheNER is the only tool that is specifically developed to identify only such names, whereas ChemSpot and OSCAR4 do not differentiate between entity types.We also show that CheNER needs less memory and CPU than the others to perform the same tasks. In addition, CheNER is self-contained, requiring only that Java is installed to run, which makes it easier to integrate in other systems.
Authors: Roman Klinger; Corinna Kolárik; Juliane Fluck; Martin Hofmann-Apitius; Christoph M Friedrich Journal: Bioinformatics Date: 2008-07-01 Impact factor: 6.937
Authors: David S Wishart; Craig Knox; An Chi Guo; Dean Cheng; Savita Shrivastava; Dan Tzur; Bijaya Gautam; Murtaza Hassanali Journal: Nucleic Acids Res Date: 2007-11-29 Impact factor: 16.971
Authors: Larry Smith; Lorraine K Tanabe; Rie Johnson nee Ando; Cheng-Ju Kuo; I-Fang Chung; Chun-Nan Hsu; Yu-Shi Lin; Roman Klinger; Christoph M Friedrich; Kuzman Ganchev; Manabu Torii; Hongfang Liu; Barry Haddow; Craig A Struble; Richard J Povinelli; Andreas Vlachos; William A Baumgartner; Lawrence Hunter; Bob Carpenter; Richard Tzong-Han Tsai; Hong-Jie Dai; Feng Liu; Yifei Chen; Chengjie Sun; Sophia Katrenko; Pieter Adriaans; Christian Blaschke; Rafael Torres; Mariana Neves; Preslav Nakov; Anna Divoli; Manuel Maña-López; Jacinto Mata; W John Wilbur Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583
Authors: Heng-Yi Wu; Deshun Lu; Mustafa Hyder; Shijun Zhang; Sara K Quinney; Zeruesenay Desta; Lang Li Journal: CPT Pharmacometrics Syst Pharmacol Date: 2018-09-29