| Literature DB >> 28477208 |
Gurusamy Murugesan1, Sabenabanu Abdulkadhar1, Balu Bhasuran2, Jeyakumar Natarajan3.
Abstract
Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction, and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin-infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching, and two-tier abbreviation algorithm. The evaluation results on BioCreative II GM test corpus of BCC-NER achieve a precision of 89.95, recall of 84.15 and overall F-score of 86.95, which is higher than the other currently available open source taggers.Entities:
Keywords: Bidirectional parsing; Biomedical text mining; Conditional random fields; Hybrid NER approaches; Margin-infused relaxed algorithm; Named entity recognition
Year: 2017 PMID: 28477208 PMCID: PMC5419958 DOI: 10.1186/s13637-017-0060-6
Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN: 1687-4145
Fig. 1The workflow of various modules in BCC-NER
Examples of orthographic, morphologic, and prefix-suffix features
| Feature | Example | Feature | Example |
|---|---|---|---|
| INITCAPS | Albumin | HAS_QUOTE | gstC’ mutans |
| ALLCAPS | SGPT | HAS_SLASH | P42/44 |
| ENDCAPS | IgA | END_PLUS | HexA+ |
| UPPER-LOWER | Serum ACTH | END_QUOTE | C’ |
| TWOCAPS | LH | HASDASH | Ap-2 |
| THREECAPS | HMG | INITDASH | -beta |
| MORECAPS | GGTP | ENDDASH | CD45- |
| MIXEDCAPS | EcoRI | 2PREFIX | Fi(fibrin) |
| LOWERCASE | Calcitonin | 3PREFIX | Fib(fibrin) |
| ENDDIGIT | cna1 | 4PREFIX | Fibr(fibrin) |
| ALPHANUMERIC | p53 | 2SUFFIX | in(fibrin) |
| SINGLECHAR | R | 3SUFFIX | rin(fibrin) |
| NUMBERS_LETTERS | UR2 | 4SUFFIX | brin(fibrin) |
| HASDIGIT | E6 | HASGREEK | TNF-alpha |
| GREEK | Alpha | HASROMAN | factor II |
| ROMAN | I,II,IV | PUNCTUATION | (,)., |
System performance on various models
| Learning model | Precision | Recall | F-measure |
|---|---|---|---|
| CRF + forward parsing + post-processing | 89.18 | 83.45 | 86.21 |
| CRF + backward parsing + post-processing | 89.38 | 83.55 | 86.36 |
| CRF+ union (Forward + backward) + post-processing | 89.58 | 83.65 | 86.51 |
| CRF + combined model MIRA + post-processing | 89.95 | 84.15 | 86.95 |
Comparison of our system with other open source systems
| System | Precision | Recall | F-measure |
|---|---|---|---|
| LingPipe | 72.95 | 88.49 | 79.97 |
| ABNER | 86.93 | 51.49 | 64.88 |
| BANNER | 88.66 | 84.32 | 86.43 |
| BCC-NER | 89.95 | 84.15 | 86.95 |