| Literature DB >> 27081155 |
Ewoud Pons1, Benedikt F H Becker2, Saber A Akhondi3, Zubair Afzal3, Erik M van Mulligen3, Jan A Kors3.
Abstract
We describe our approach to the chemical-disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached anF-score of 0.757. For CID, the system achieved anF-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improvedF-scores (0.828 for DNER and 0.602 for CID). RELigator is available as a web service athttp://biosemantics.org/index.php/software/religator.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27081155 PMCID: PMC4831722 DOI: 10.1093/database/baw046
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Workflow for CDR extraction. The chemical and disease entities in a Medline abstract are recognized and mapped to their corresponding MeSH identifiers by tmChem (for chemicals) and Peregrine (for diseases). For each possible combination of chemicals and diseases that are found in the document, features are generated based on prior knowledge from a knowledge platform, and based on statistical and linguistic information from the document. The features are fed to an SVM classifier to detect CIDs.
Characteristics of the CDR corpus
| Data | Training | Development | Test | Total |
|---|---|---|---|---|
| Abstracts | 500 | 500 | 500 | 1500 |
| Chemical mentions | 5203 | 5347 | 5385 | 15 935 |
| Unique chemical identifiers | 1467 | 1507 | 1435 | 4409 |
| Disease mentions | 4182 | 4244 | 4424 | 12 850 |
| Unique disease identifiers | 1965 | 1865 | 1988 | 5718 |
| CDRs | 1038 | 1012 | 1066 | 3116 |
Figure 2Example dependency parse tree for a sentence about the chemical ‘acetaminophen’ and the disease ‘anaphylaxis’. The governing verb of the disease is ‘produce’; the governing verb of the chemical is ‘demonstrated’, which is also the relating word.
Performance of the Peregrine challenge and post-challenge systems for disease normalization on the test set
| System | Recall | Precision | F-score |
|---|---|---|---|
| Peregrine, challenge | 0.772 | 0.737 | 0.757 |
| Peregrine, post challenge | 0.839 | 0.818 | 0.828 |
Error analysis of 50 false-positive and 50 false-negative errors of the post-challenge Peregrine system
| Error type | False-positive | False-negative |
|---|---|---|
| Term mapped to incorrect MeSH identifier | 8 | 6 |
| Term incorrectly on exclusion list | - | 5 |
| Term partially recognized | 13 | 15 |
| Term incorrectly recognized | 23 | - |
| Term not recognized | - | 20 |
| Annotation error | 6 | 4 |
Performance of different relation extraction systems on the CDR training and development data, given perfect entity annotations
| System | Threshold | Recall | Precision | F-score |
|---|---|---|---|---|
| Co-occurrence at sentence level | n/a | 0.725 | 0.313 | 0.437 |
| Knowledge base | n/a | 0.664 | 0.405 | 0.503 |
| SVM, all challenge features | 0.30 | 0.840 | 0.693 | 0.760 |
| SVM, all post-challenge features | 0.34 | 0.854 | 0.753 | 0.801 |
| without prior knowledge features | 0.33 | 0.765 | 0.695 | 0.728 |
| without statistical features | 0.39 | 0.775 | 0.683 | 0.726 |
| without linguistic features | 0.38 | 0.842 | 0.701 | 0.765 |
Probability threshold for the SVM to decide whether there is a relationship.
Performance of relation extraction systems on the CDR test data, for different entity annotations
| System | Entity annotation | Threshold | Recall | Precision | F-score |
|---|---|---|---|---|---|
| SVM, all challenge features | tmChem, Peregrine challenge | 0.20 | 0.601 | 0.540 | 0.569 |
| SVM, all challenge features | tmChem, Peregrine challenge | 0.30 | 0.537 | 0.579 | 0.557 |
| SVM, all challenge features | tmChem, Peregrine challenge | 0.40 | 0.467 | 0.605 | 0.527 |
| SVM, all challenge features | tmChem, Peregrine post-challenge | 0.30 | 0.556 | 0.569 | 0.563 |
| SVM, all post-challenge features | tmChem, Peregrine post-challenge | 0.34 | 0.570 | 0.637 | 0.602 |
| SVM, all post-challenge features | Gold standard | 0.34 | 0.731 | 0.676 | 0.702 |
Probability threshold for the SVM to decide whether there is a relationship.