Literature DB >> 14871877

Recognizing names in biomedical texts: a machine learning approach.

GuoDong Zhou1, Jie Zhang, Jian Su, Dan Shen, ChewLim Tan.   

Abstract

MOTIVATION: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition.
RESULTS: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the "protein" class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the "protein" class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the "protein" class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary. AVAILABILITY: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.

Entities:  

Mesh:

Year:  2004        PMID: 14871877     DOI: 10.1093/bioinformatics/bth060

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  34 in total

1.  MachineProse: an ontological framework for scientific assertions.

Authors:  Deendayal Dinakarpandian; Yugyung Lee; Kartik Vishwanath; Rohini Lingambhotla
Journal:  J Am Med Inform Assoc       Date:  2005-12-15       Impact factor: 4.497

2.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

Authors:  Richard Tzong-Han Tsai; Cheng-Lung Sung; Hong-Jie Dai; Hsieh-Chuan Hung; Ting-Yi Sung; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

3.  Evaluating the state-of-the-art in automatic de-identification.

Authors:  Ozlem Uzuner; Yuan Luo; Peter Szolovits
Journal:  J Am Med Inform Assoc       Date:  2007-06-28       Impact factor: 4.497

Review 4.  Network integration and graph analysis in mammalian molecular systems biology.

Authors:  A Ma'ayan
Journal:  IET Syst Biol       Date:  2008-09       Impact factor: 1.615

Review 5.  Recent progress in automatically extracting information from the pharmacogenomic literature.

Authors:  Yael Garten; Adrien Coulet; Russ B Altman
Journal:  Pharmacogenomics       Date:  2010-10       Impact factor: 2.533

6.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.

Authors:  Min Jiang; Yukun Chen; Mei Liu; S Trent Rosenbloom; Subramani Mani; Joshua C Denny; Hua Xu
Journal:  J Am Med Inform Assoc       Date:  2011-04-20       Impact factor: 4.497

7.  A flexible framework for deriving assertions from electronic medical records.

Authors:  Kirk Roberts; Sanda M Harabagiu
Journal:  J Am Med Inform Assoc       Date:  2011-07-01       Impact factor: 4.497

8.  Information extraction approaches to unconventional data sources for "Injury Surveillance System": the case of newspapers clippings.

Authors:  Paola Berchialla; Cecilia Scarinzi; Silvia Snidero; Yousif Rahim; Dario Gregori
Journal:  J Med Syst       Date:  2010-04-27       Impact factor: 4.460

9.  Gimli: open source and high-performance biomedical name recognition.

Authors:  David Campos; Sérgio Matos; José Luís Oliveira
Journal:  BMC Bioinformatics       Date:  2013-02-15       Impact factor: 3.169

10.  Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.

Authors:  Fei Zhu; Bairong Shen
Journal:  PLoS One       Date:  2012-06-26       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.