| Literature DB >> 27161011 |
Jiao Li1, Yueping Sun1, Robin J Johnson2, Daniela Sciaky2, Chih-Hsuan Wei3, Robert Leaman3, Allan Peter Davis2, Carolyn J Mattingly2, Thomas C Wiegers2, Zhiyong Lu4.
Abstract
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.Entities:
Mesh:
Year: 2016 PMID: 27161011 PMCID: PMC4860626 DOI: 10.1093/database/baw068
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Comparison with the previous chemical disease relation corpora
| Corpus | Annotation scope | Size | Entity annotation—Mention | Entity annotation—Concept | Relation annotation |
|---|---|---|---|---|---|
| BC5CDR | Abstract | 1500 | Yes | Yes | Yes |
| EU-ADR (16) | Sentence | 300 | Yes | Yes | Yes |
| ADE (17) | Sentence | 2972 | Yes | No | Yes |
| Corpus (18) | Abstract | 400 | Yes | Yes | No |
Figure 1.Annotation example shown in our annotation tool, PubTator.
Figure 2.PubTator format annotation (PMID: 354896).
Figure 3.BioC format annotation (PMID: 354896).
The overall corpus statistics
| Task dataset | Articles | Disease | Chemical | CID relation | ||
|---|---|---|---|---|---|---|
| Mention | ID | Mention | ID | |||
| Training | 500 | 4182 | 1965 | 5203 | 1467 | 1038 |
| Development | 500 | 4244 | 1865 | 5347 | 1507 | 1012 |
| Test | 500 | 4424 | 1988 | 5385 | 1435 | 1066 |
Inter-annotator agreement (IAA) scores of the three sets
| Task dataset | Disease | Chemical |
|---|---|---|
| Training | 0.8600 | 0.9523 |
| Development | 0.8742 | 0.9577 |
| Test | 0.8875 | 0.9630 |
| All | 0.8749 | 0.9605 |
Figure 4..Disagreements of disease and chemical annotations.
Figure 5.Distribution of disease mentions in the corpus.
Figure 6.Distribution of chemical mentions in the corpus.
Figure 7.Distribution of chemical-induced disease relations in the corpus.