| Literature DB >> 21633495 |
Balakrishna Kolluru1, Lezan Hawizy, Peter Murray-Rust, Junichi Tsujii, Sophia Ananiadou.
Abstract
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.Entities:
Mesh:
Year: 2011 PMID: 21633495 PMCID: PMC3102085 DOI: 10.1371/journal.pone.0020181
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Showing a normal workflow and a reconfigurable workflow as can be built by using U-Compare.
Figure 2The original architecture of Oscar3.
Figure 3Oscar3 refactored as a workflow of different components.
Figure 4U-Compare view of Oscar workflow.
Right side of the figure shows a workflow made from the Oscar components shown on the left.
Figure 5ROC curves comparing the performance of various Oscar variants.
In all the four different experiments, Oscar workflow has a slight edge over the Oscar 3 variant.
Performance (%) of different variants of Oscar on Sciborg test data using the models trained on Sciborg data and PubMed data.
| Variants on Sciborg | Model used | |
|
|
| |
| Oscar3 (MEMM) | P 88.24 | 74.76 |
| R 77.19 | 65.18 | |
| F 82.35 | 69.64 | |
| Oscar workflow (MEMM) | P 90.31 | 80.19 |
| R 79.29 | 71.22 | |
| F 84.44 | 75.44 | |
Performance of different Oscar pattern recogniser versions on Sciborg.
| Variants on Sciborg | Scores (%) |
| Oscar3 (PAT) | P 70.43 |
| R 67.42 | |
| F 68.89 | |
| Oscar workflow (PAT) | P 74.11 |
| R 73.68 | |
| F 73.90 |
Performance of different variants of Oscar on PubMed test data using the models trained on Sciborg data and PubMed data.
| Variants on PubMed | Model used | |
|
|
| |
| Oscar3 (MEMM) | P 75.28 | 89.04 |
| R 63.42 | 79.91 | |
| F 68.84 | 84.23 | |
| Oscar workflow (MEMM) | P 75.06 | 85.66 |
| R 64.58 | 84.03 | |
| F 69.43 | 84.84 | |
Performance of different Oscar pattern recogniser versions on Pubmed.
| Variants on Pubmed | Scores (%) |
| Oscar3 (PAT) | P 44.22 |
| R 58.24 | |
| F 50.27 | |
| Oscar workflow (PAT) | P 45.64 |
| R 60.35 | |
| F 51.97 |
Figure 6U-Compare output for a test document.
Chemical names (underlined) as identified by the MEMM-based workflow.