Literature DB >> 17254295

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

Richard Tzong-Han Tsai1, Cheng-Lung Sung, Hong-Jie Dai, Hsieh-Chuan Hung, Ting-Yi Sung, Wen-Lian Hsu.   

Abstract

BACKGROUND: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.
RESULTS: To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features.
CONCLUSION: We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.

Entities:  

Mesh:

Year:  2006        PMID: 17254295      PMCID: PMC1764467          DOI: 10.1186/1471-2105-7-S5-S11

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  8 in total

1.  Playing biology's name game: identifying protein names in scientific text.

Authors:  Daniel Hanisch; Juliane Fluck; Heinz-Theodor Mevissen; Ralf Zimmer
Journal:  Pac Symp Biocomput       Date:  2003

Review 2.  Mining the biomedical literature in the genomic era: an overview.

Authors:  Hagit Shatkay; Ronen Feldman
Journal:  J Comput Biol       Date:  2003       Impact factor: 1.479

3.  Recognizing names in biomedical texts: a machine learning approach.

Authors:  GuoDong Zhou; Jie Zhang; Jian Su; Dan Shen; ChewLim Tan
Journal:  Bioinformatics       Date:  2004-02-10       Impact factor: 6.937

4.  Toward information extraction: identifying protein names from biological papers.

Authors:  K Fukuda; A Tamura; T Tsunoda; T Takagi
Journal:  Pac Symp Biocomput       Date:  1998

5.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

6.  iProLINK: an integrated protein resource for literature mining.

Authors:  Zhang-Zhi Hu; Inderjeet Mani; Vincent Hermoso; Hongfang Liu; Cathy H Wu
Journal:  Comput Biol Chem       Date:  2004-12       Impact factor: 2.877

7.  Various criteria in the evaluation of biomedical named entity recognition.

Authors:  Richard Tzong-Han Tsai; Shih-Hung Wu; Wen-Chi Chou; Yu-Chun Lin; Ding He; Jieh Hsiang; Ting-Yi Sung; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2006-02-24       Impact factor: 3.169

8.  Identifying gene and protein mentions in text using conditional random fields.

Authors:  Ryan McDonald; Fernando Pereira
Journal:  BMC Bioinformatics       Date:  2005-05-24       Impact factor: 3.169

  8 in total
  32 in total

1.  A context-aware approach for progression tracking of medical concepts in electronic medical records.

Authors:  Nai-Wen Chang; Hong-Jie Dai; Jitendra Jonnagaddala; Chih-Wei Chen; Richard Tzong-Han Tsai; Wen-Lian Hsu
Journal:  J Biomed Inform       Date:  2015-09-30       Impact factor: 6.317

2.  Beyond accuracy: creating interoperable and scalable text-mining web services.

Authors:  Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal:  Bioinformatics       Date:  2016-02-16       Impact factor: 6.937

3.  A literature search tool for intelligent extraction of disease-associated genes.

Authors:  Jae-Yoon Jung; Todd F DeLuca; Tristan H Nelson; Dennis P Wall
Journal:  J Am Med Inform Assoc       Date:  2013-09-02       Impact factor: 4.497

4.  Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles.

Authors:  Richard Tzong-Han Tsai; Po-Ting Lai
Journal:  BMC Bioinformatics       Date:  2011-02-23       Impact factor: 3.169

5.  Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.

Authors:  Hong-Jie Dai; Po-Ting Lai; Yung-Chun Chang; Richard Tzong-Han Tsai
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

6.  Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings.

Authors:  Hong-Jie Dai; Chu-Hsien Su; Chi-Shin Wu
Journal:  J Am Med Inform Assoc       Date:  2020-01-01       Impact factor: 4.497

7.  Extracting chemical-protein relations using attention-based neural networks.

Authors:  Sijia Liu; Feichen Shen; Ravikumar Komandur Elayavilli; Yanshan Wang; Majid Rastegar-Mojarad; Vipin Chaudhary; Hongfang Liu
Journal:  Database (Oxford)       Date:  2018-01-01       Impact factor: 3.451

8.  Gimli: open source and high-performance biomedical name recognition.

Authors:  David Campos; Sérgio Matos; José Luís Oliveira
Journal:  BMC Bioinformatics       Date:  2013-02-15       Impact factor: 3.169

9.  Semi-automatic conversion of BioProp semantic annotation to PASBio annotation.

Authors:  Richard Tzong-Han Tsai; Hong-Jie Dai; Chi-Hsin Huang; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2008-12-12       Impact factor: 3.169

10.  HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.

Authors:  Richard Tzong-Han Tsai; Po-Ting Lai; Hong-Jie Dai; Chi-Hsin Huang; Yue-Yang Bow; Yen-Ching Chang; Wen-Harn Pan; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2009-12-03       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.