| Literature DB >> 25909637 |
Parisa Kordjamshidi1,2, Dan Roth3, Marie-Francine Moens4.
Abstract
BACKGROUND: We aim to automatically extract species names of bacteria and their locations from webpages. This task is important for exploiting the vast amount of biological knowledge which is expressed in diverse natural language texts and putting this knowledge in databases for easy access by biologists. The task is challenging and the previous results are far below an acceptable level of performance, particularly for extraction of localization relationships. Therefore, we aim to design a new system for such extractions, using the framework of structured machine learning techniques.Entities:
Mesh:
Year: 2015 PMID: 25909637 PMCID: PMC4426185 DOI: 10.1186/s12859-015-0542-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An input example structure represented as a document and its NLP features at different layers (document, paragraph, sentence,...) independent from the output representation and its elements.
Figure 2The output space of BB-localization task.
Figure 3The output space when placing BB-localization in the SpRL framework. Double-line: the unknown elements relevant for our training/prediction model, Red: the output elements in the original BioNLP-task, Blue: external information that can be used, Black: output concepts of interest that are not annotated in the data.
Local and global features of various input components
|
|
|
|
|---|---|---|
| Word features | Lexical-form | Word surface that appears in the text |
| Bio-lemma | Word lemma using a lemmatizer for biomedical domain which uses additional lexical resources [24] | |
| POS-tag | Part of speech tag of a word to exploit the syntactical information for training | |
| Dprl | Dependency relation of a word to its syntactic head which gives clues to the semantic relationships | |
| Cocoa | Word tag using Cocoa - an external resource of biological concepts | |
| Capital | If a word starts with a capital letter | |
| Stop-word | If a word belongs to a list of stop words | |
| Phrase features | Head-features | The features of the word which is the syntactic head of a phrase |
| nHead-features | The features of other words contained in the phrase | |
| Lexical-surface | Concatenation of the lexical form of the words in the phrase | |
| Phrasal-POS | The phrasal part of speech tag: the parse tree tag of the common parent of the words in a phrase | |
| NCBI-sim | Comparing the phrase and the list of bacterium names in NCBI | |
| Ontobio-sim | Comparing the phrase and the habitat classes in OntoBiotope | |
| Phrase-pair features | Same-par | If two phrases occur in same paragraph |
| Same-sen | If two phrases occur in one sentence | |
| inTitle | If bacterium candidate occurs in the title | |
| Verb | The verb in between the two phrases- if in same sentence | |
| Preposition | The preposition in between the two phrases-if in same sentence | |
| Parse-Dis | The distance between the two phrases using the parse tree | |
| Parse-Path | The path between the two phrases using the parse tree | |
| Heads-Lem | The concatenation of the lemma of the heads | |
| Heads-POS | The concatenation of the POS-tag of the two heads | |
| Dep-Path | The dependency path between the two heads | |
| Relation-pair features | Same-B | If two relations have exactly the same bacterium candidate |
| Sim-BH | Similarity of two relations based on the similarity of their bacterium and habitat candidates |
Local training/prediction vs. joint training and prediction over training/development sets; significant improvement made by the joint training models (IBT) on localization relationship (Loc) extraction
|
|
|
|
| ||
|---|---|---|---|---|---|
| Bac. | P | 0.959 | 0.959 | 0.991 | 0.972 |
| R | 0.993 | 0.994 | 0.970 | 0.978 | |
| F | 0.976 | 0.976 | 0.980 | 0.975 | |
| Hab. | P | 0.977 | 0.977 | 0.987 | 0.977 |
| R | 0.964 | 0.964 | 0.923 | 0.975 | |
| F | 0.971 | 0.971 | 0.954 | 0.976 | |
| Loc. | P | 0.188 | 0.20 | 0.311 | 0.318 |
| R | 0.274 | 0.268 | 0.584 | 0.580 | |
| F | 0.223 | 0.229 | 0.406 | 0.411 |
IBT+I vs. task-3 participants (TEES and LIMSI) evaluated on test set by the online system of the BioNLP-ST 2013 task; relations without gold entities; relaxed scores in parenthesis
|
|
|
|
|
|---|---|---|---|
| TEES | 0.18 (0.61) | 0.12 (0.41) | 0.14 (0.49) |
| LIMSI | 0.12 (0.15) | 0.04 (0.08) | 0.06 (0.09) |
| IBT+IG1 (p) | 0.311 | 0.171 | 0.221 |
| IBT+I | 0.238 (0.596) | 0.279 (0.561) | 0.257 (0.578) |
| IBT+IG1 | 0.311 (0.594) | 0.241 (0.483) | 0.272 (0.533) |
| IBT+IG2 | 0.331 (0.588) | 0.224 (0.431) | 0.267 (0.498) |
| IBT+I (s) | 0.241 (0.515) | 0.436 (0.624) | 0.311 (0.564) |
| IBT+IG1 (s) | 0.305 (0.563) | 0.400 (0.640) | 0.346 (0.599) |
| IBT+IG2 (s) | 0.327 (0.555) | 0.367 (0.560) | 0.346 (0.558) |
(s) denotes the sentence level evaluation. (p) denotes that the strict evaluation also is punished by missing PartOf relations.