| Literature DB >> 25093067 |
Pontus Stenetorp1, Sampo Pyysalo2, Sophia Ananiadou2, Jun'ichi Tsujii3.
Abstract
BACKGROUND: Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example Protein to "Fibrin". SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coreference resolution and coordination resolution. In this work, we study machine learning-based SCD methods using large lexical resources and approximate string matching, aiming to generalise these methods with regard to domains, lexical resources and the composition of data sets. We specifically consider the applicability of SCD for the purposes of supporting human annotators and acting as a pipeline component for other Natural Language Processing systems.Entities:
Keywords: Approximate string matching; Domain adaptation; Freebase; Lexical resources; Named entity recognition; Semantic category disambiguation
Year: 2014 PMID: 25093067 PMCID: PMC4107982 DOI: 10.1186/2041-1480-5-26
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Example of the prerequisite for our task setting, demarked continuous spans as seen in (a) and the output, semantic categories assigned to the input spans as seen in (b). “2-comp-sys”, “Pro” and “+Regulation” are used as short-hands for “Two-component system”, “Protein” and “Positive regulation” respectively. Note the potential for partial overlap of different semantic categories as can be seen for the “Protein” and “Two-component system” annotations.
Figure 2Examples of entity type annotations from [25], illustrating how the amount of visual and user-interface complexity (a) can be reduced using an SCD system (b). The relevant text span being annotated in both figures is “heart” which should be assigned the ORGAN semantic category.
Corpora used for evaluation
| Epigenetics and Post-Translational | 17 |
| Modifications corpus [ | |
| Infectious Diseases corpus [ | 16 |
| Genia Event corpus [ | 11 |
| Collaborative Annotation of a Large | 4 |
| Biomedical Corpus [ | |
| BioNLP/NLPBA 2004 Shared Task | 5 |
| corpus [ | |
| Gene Regulation Event Corpus [ | 64 (6) |
| Multi-Level Event Extraction corpus [ | 52 |
| GeneReg corpus [ | 10 |
| Gene Expression Text Miner corpus [ | 3 |
| BioInfer [ | 119 (97) |
| BioText [ | 2 |
| CoNLL-2002 Shared Task corpus, | 4 |
| Spanish subset [ | |
| CoNLL-2002 Shared Task corpus, Dutch | 4 |
| subset [ | |
| i2b2 Medication Challenge corpus [ | 6 |
| OSIRIS corpus [ | 2 |
Parenthesised values signify the actual number of categories after performing pre-processing steps so as to not suffer from data sparseness (GREC conversion into SGREC[3]) or to compensate for ontological design decisions (BI). The mid-line indicates a cut-off between the above corpora used in previous work [3] and the corpora added to evaluate our approach for a variety of domains and covering a large set of semantic categories.
Figure 3Example of sub-string components used to generate the NP-based features.
Figure 4Learning curves for ambiguity (a) and recall (b) for our initial ambiguity experiments.
Results for the BT, GETM, I2B2 and OSIRIS data sets using the Int.NP.Sim. model with a confidence threshold of 95% for mean ambiguity reduction ( ), mean recall ( ), and the harmonic mean of mean ambiguity reduction and recall ( ( ))
| 78.00/+34.00 | 99.54/-00.31 | 87.46/+26.38 | |
| 88.50/+32.50 | 99.99/-00.01 | 93.89/+22.10 | |
| 77.60/+42.60 | 98.14/-01.50 | 86.67/+34.87 | |
| 78.00/+42.00 | 99.79/-00.21 | 87.56/+34.62 |
The relative values are compared to the same model using a confidence threshold of 99.5%.