| Literature DB >> 25810773 |
Martin Krallinger1, Obdulia Rabal2, Florian Leitner3, Miguel Vazquez1, David Salgado4, Zhiyong Lu5, Robert Leaman5, Yanan Lu6, Donghong Ji6, Daniel M Lowe7, Roger A Sayle7, Riza Theresa Batista-Navarro8, Rafal Rak8, Torsten Huber9, Tim Rocktäschel10, Sérgio Matos11, David Campos11, Buzhou Tang12, Hua Xu13, Tsendsuren Munkhdalai14, Keun Ho Ryu14, S V Ramanan15, Senthil Nathan15, Slavko Žitnik16, Marko Bajec16, Lutz Weber17, Matthias Irmer17, Saber A Akhondi18, Jan A Kors18, Shuo Xu19, Xin An20, Utpal Kumar Sikdar21, Asif Ekbal21, Masaharu Yoshioka22, Thaer M Dieb22, Miji Choi23, Karin Verspoor24, Madian Khabsa25, C Lee Giles26, Hongfang Liu27, Komandur Elayavilli Ravikumar27, Andre Lamurias28, Francisco M Couto28, Hong-Jie Dai29, Richard Tzong-Han Tsai30, Caglar Ata31, Tolga Can31, Anabel Usié32, Rui Alves33, Isabel Segura-Bedmar34, Paloma Martínez34, Julen Oyarzabal2, Alfonso Valencia1.
Abstract
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.Entities:
Keywords: BioCreative; ChemNLP; chemical entity recognition; chemical indexing; machine learning; named entity recognition; text mining
Year: 2015 PMID: 25810773 PMCID: PMC4331692 DOI: 10.1186/1758-2946-7-S1-S2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1CHEMDNER chemical entity mention classification chart and examples.
Figure 2Left side: Overview of the manual CHEMDNER corpus annotation process. Right side and bottom: Annotation examples for the Chemical Document Indexing (CDI) and Chemical Entity Mention (CEM) task.
CHEMDNER corpus overview.
| Training set | Development set | Test set | Entire corpus | |
|---|---|---|---|---|
| Abstracts | 3,500 | 3,500 | 3,000 | 10,000 |
| Nr. characters | 4,883,753 | 4,864,558 | 4,199,068 | 13,947,379 |
| Nr. tokens | 770,855 | 766,331 | 662,571 | 2,199,757 |
| Abstracts with SACEM | 2,916 | 2,907 | 2,478 | 8,301 |
| Nr. mentions | 29,478 | 29,526 | 25,351 | 84,355 |
| Nr. chemicals | 8,520 | 8,677 | 7,563 | 19,805 |
| Nr. journals | 193 | 188 | 188 | 203 |
| TRIVIAL | 8,832 | 8,970 | 7,808 | 25,610 |
| SYSTEMATIC | 6,656 | 6,816 | 5,666 | 19,138 |
| ABBREVIATION | 4,538 | 4,521 | 4059 | 13,118 |
| FORMULA | 4,448 | 4,137 | 3,443 | 12,028 |
| FAMILY | 4,090 | 4,223 | 3,622 | 11,935 |
| IDENTIFIER | 672 | 639 | 513 | 1,824 |
| MULTIPLE | 202 | 188 | 199 | 589 |
| NO CLASS | 40 | 32 | 41 | 113 |
This table provides an overview of the CHEMDNER corpus in terms of the number of manually revised abstracts (Abstracts) with their total sizes as number of characters and tokens, the number of abstracts containing at least one chemical entity mention (Abstracts with CEM), the number of annotated mentions of chemical entities, the number of unique chemicals annotated (the non-redundant list of mentions) and the number of corresponding journals for the annotated abstracts. The number of mentions for each CHEMDNER entity class (see Figure 1) is provided for each set and the entire corpus in the lower half of the table.
Figure 3Chemical entity frequency. (A) Zipf plot of all chemical entities in the CHEMDNER corpus. (b) Most frequent chemical mentions of the CHEMDNER corpus. Note: The annotation guidelines specified a small stop list of chemicals that were not annotated.
CHEMDNER abstracts, split into chemical disciplines (subject categories, first column; MULTIDISCIPL. CHEM.: Multidisciplinary Chemistry).
| Chem. subject categories | Abstracts | Mentions | AB | FA | FO | ID | MU | NO | SY | TR |
|---|---|---|---|---|---|---|---|---|---|---|
| PHARMACOLOGY | 1,983 | 23,368 | 18.81 | 10.54 | 6.42 | 4.93 | 0.64 | 0.29 | 17.28 | 41.09 |
| MEDICINAL CHEMISTRY | 1,957 | 17,543 | 10.00 | 21.11 | 8.00 | 2.10 | 1.56 | 0.12 | 25.88 | 31.23 |
| ORGANIC CHEMISTRY | 1,893 | 22,622 | 18.77 | 10.56 | 6.56 | 5.00 | 0.63 | 0.30 | 17.43 | 40.74 |
| TOXICOLOGY | 1,664 | 21,608 | 20.82 | 10.59 | 14.16 | 1.35 | 0.46 | 0.13 | 22.68 | 29.81 |
| MULTIDISCIPL. CHEM. | 1,217 | 11,892 | 14.38 | 12.15 | 27.97 | 0.52 | 0.55 | 0.13 | 25.62 | 18.67 |
| PHYSICAL CHEMISTRY | 997 | 9,682 | 12.14 | 9.81 | 36.39 | 0.27 | 0.43 | 0.15 | 27.57 | 13.24 |
| BIOCHEMISTRY | 879 | 6,503 | 18.75 | 16.55 | 14.24 | 1.12 | 0.34 | 0.11 | 23.17 | 25.73 |
| APPLIED CHEMISTRY | 843 | 7,759 | 8.48 | 24.45 | 7.71 | 0.17 | 1.37 | 0.10 | 24.99 | 32.74 |
| ENDOCRINOLOGY | 652 | 5,484 | 14.66 | 16.01 | 9.87 | 1.33 | 0.15 | 0.15 | 20.13 | 37.71 |
| POLYMER SCIENCE | 232 | 1,999 | 33.82 | 17.26 | 6.50 | 0.05 | 0.10 | 0.00 | 25.86 | 16.41 |
| CHEMICAL ENGINEERING | 3 | 42 | 0.00 | 0.00 | 38.10 | 0.00 | 0.00 | 0.00 | 61.90 | 0.00 |
Abstracts: The number of abstracts associated with that category in the CHEMDNER corpus. Mentions: The total number of chemical entity mentions in the abstracts of that category. Remaining columns: The values provided for the different SACEM classes correspond to the percentage of mentions in that category; AB: ABBREVIATION, FA: FAMILY, FO: FORMULA, ID: IDENTIFIER, MU: MULTIPLE, NO: NO CLASS, SY: SYSTEMATIC, TR: TRIVIAL.
Figure 4The hallmarks of text corpus construction that were applied to the BioCreative CHEMDNER task.