| Literature DB >> 15960841 |
GuoDong Zhou1, Dan Shen, Jie Zhang, Jian Su, SoonHeng Tan.
Abstract
This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognition task (Task 1A).Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960841 PMCID: PMC1869021 DOI: 10.1186/1471-2105-6-S1-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Orthographic Feature
| Features 1–11 | Features 12–21 | ||
| Comma | OneCap | ||
| Dot | AllCaps | ||
| Parenthesis | CapLowAlpha | ||
| RomanDigit | CapMixAlpha | ||
| GreekLetter | LowMixAlpha | ||
| StopWord | AlphaDigitAlpha | ||
| ATCGsequence | AlphaDigit | ||
| OneDigit | DigitAlphaDigit | ||
| AllDigits | DigitAlpha | ||
| DigitCommaDigit | Others | ||
| DigitDotDigit |
Detailed performance of various components in our best closed system (closed-3)
| Configuration | P | R | F |
| SVM (individual) | 75.1 | 70.2 | 72.7 |
| DHMM1 (individual) | 71.6 | 71.9 | 71.8 |
| DHMM2 (individual) | 70.1 | 74.3 | 72.1 |
| Ensemble (majority voting) | 75.9 | 77.0 | 76.4 |
| Ensemble + Abbreviation Resolution | 79.8 | 80.4 | 80.1 |
| Ensemble + Name Refinement | 78.6 | 79.1 | 78.8 |
| Ensemble + Dictionary Matching | 75.5 | 78.5 | 76.9 |
| All (overall performance) | 82.0 | 83.2 | 82.6 |
Performance and configurations of all the evaluations in the protein/gene name recognition task
| Modules | Closed-1 | Closed-2 | Closed-3 | Open-1 |
| SVM | Surface word, orthographic feature, morphological pattern, trigger word | |||
| GENIA-POS | Refined-BioCreative-POS | Refined-BioCreative-POS | ||
| DHMM1 | Surface word, orthographic feature | |||
| GENIA-POS | Refined- BioCreative-POS | Refined-BioCreative-POS | ||
| DHMM2 | Surface word, orthographic feature, BioCreative-POS | |||
| Ensemble | Majority Voting | |||
| Abbreviation Resolution | Abbreviation Resolution based on the parentheses structure | |||
| Name Refinement | N/A | N/A | N/A | |
| Dictionary Matching | Closed Dictionary | Closed Dictionary | Closed Dictionary | |
| Overall Performance | P79.97 | P80.46 | P82.00 | P75.10 |
| R80.50 | R80.80 | R83.17 | R81.26 | |
| F80.23 | F80.63( | F82.58( | F78.06( | |
Contributions of various features in our best closed system (closed-3): decrease in precision/recall/F-measure by leaving one feature at a time.
| Feature | P | R | F |
| Orthographic Feature | 27.1 | 42.5 | 33.2 |
| POS | 23.7 | 31.1 | 26.8 |
| Surface Word | 12.2 | 8.1 | 10.1 |
| Trigger Word | 2.4 | 1.6 | 1.9 |
| Morphological Pattern | 1.3 | 1.0 | 1.1 |