Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Investigating heterogeneous protein annotations toward cross-corpora utilization.

Literature DB >> 19995463

Investigating heterogeneous protein annotations toward cross-corpora utilization.

Yue Wang¹, Jin-Dong Kim, Rune Saetre, Sampo Pyysalo, Jun'ichi Tsujii.

Abstract

BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.
RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.
CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19995463 PMCID： PMC2804683 DOI： 10.1186/1471-2105-10-403

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

13 in total

1. Disambiguating proteins, genes, and RNA in text: a machine learning approach.

Authors: V Hatzivassiloglou; P A Duboué; A Rzhetsky
Journal: Bioinformatics Date: 2001 Impact factor: 6.937

2. Protein names and how to find them.

Authors: Kristofer Franzén; Gunnar Eriksson; Fredrik Olsson; Lars Asker; Per Lidén; Joakim Cöster
Journal: Int J Med Inform Date: 2002-12-04 Impact factor: 4.046

3. GENIA corpus--semantically annotated corpus for bio-textmining.

Authors: J-D Kim; T Ohta; Y Tateisi; J Tsujii
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

Review 4. Mining the biomedical literature in the genomic era: an overview.

Authors: Hagit Shatkay; Ronen Feldman
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

Review 5. A survey of current work in biomedical text mining.

Authors: Aaron M Cohen; William R Hersh
Journal: Brief Bioinform Date: 2005-03 Impact factor: 11.622

6. POSBIOTM-NER: a trainable biomedical named-entity recognition system.

Authors: Yu Song; Eunju Kim; Gary Geunbae Lee; Byoung-Kee Yi
Journal: Bioinformatics Date: 2005-04-06 Impact factor: 6.937

7. Various criteria in the evaluation of biomedical named entity recognition.

Authors: Richard Tzong-Han Tsai; Shih-Hung Wu; Wen-Chi Chou; Yu-Chun Lin; Ding He; Jieh Hsiang; Ting-Yi Sung; Wen-Lian Hsu
Journal: BMC Bioinformatics Date: 2006-02-24 Impact factor: 3.169

8. GENETAG: a tagged corpus for gene/protein named entity recognition.

Authors: Lorraine Tanabe; Natalie Xie; Lynne H Thom; Wayne Matten; W John Wilbur
Journal: BMC Bioinformatics Date: 2005-05-24 Impact factor: 3.169

9. Comparative analysis of five protein-protein interaction corpora.

Authors: Sampo Pyysalo; Antti Airola; Juho Heimonen; Jari Björne; Filip Ginter; Tapio Salakoski
Journal: BMC Bioinformatics Date: 2008-04-11 Impact factor: 3.169

10. Corpus refactoring: a feasibility study.

Authors: Helen L Johnson; William A Baumgartner; Martin Krallinger; K Bretonnel Cohen; Lawrence Hunter
Journal: J Biomed Discov Collab Date: 2007-09-13

12 in total

1. The synaptoneurosome transcriptome: a model for profiling the emolecular effects of alcohol.

Authors: D Most; L Ferguson; Y Blednov; R D Mayfield; R A Harris
Journal: Pharmacogenomics J Date: 2014-08-19 Impact factor: 3.550

2. Complex event extraction at PubMed scale.

Authors: Jari Björne; Filip Ginter; Sampo Pyysalo; Jun'ichi Tsujii; Tapio Salakoski
Journal: Bioinformatics Date: 2010-06-15 Impact factor: 6.937

3. Pooling annotated corpora for clinical concept extraction.

Authors: Kavishwar B Wagholikar; Manabu Torii; Siddhartha R Jonnalagadda; Hongfang Liu
Journal: J Biomed Semantics Date: 2013-01-08

4. An analysis of gene/protein associations at PubMed scale.

Authors: Sampo Pyysalo; Tomoko Ohta; Jun'ichi Tsujii
Journal: J Biomed Semantics Date: 2011-10-06

5. Event extraction for DNA methylation.

Authors: Tomoko Ohta; Sampo Pyysalo; Makoto Miwa; Jun'ichi Tsujii
Journal: J Biomed Semantics Date: 2011-10-06

6. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011.

Authors: Sampo Pyysalo; Tomoko Ohta; Rafal Rak; Dan Sullivan; Chunhong Mao; Chunxia Wang; Bruno Sobral; Jun'ichi Tsujii; Sophia Ananiadou
Journal: BMC Bioinformatics Date: 2012-06-26 Impact factor: 3.169

7. Wide coverage biomedical event extraction using multiple partially overlapping corpora.

Authors: Makoto Miwa; Sampo Pyysalo; Tomoko Ohta; Sophia Ananiadou
Journal: BMC Bioinformatics Date: 2013-06-03 Impact factor: 3.169

8. Using empirically constructed lexical resources for named entity recognition.

Authors: Siddhartha Jonnalagadda; Trevor Cohen; Stephen Wu; Hongfang Liu; Graciela Gonzalez
Journal: Biomed Inform Insights Date: 2013-06-24

9. BioC: a minimalist approach to interoperability for biomedical text processing.

Authors: Donald C Comeau; Rezarta Islamaj Doğan; Paolo Ciccarese; Kevin Bretonnel Cohen; Martin Krallinger; Florian Leitner; Zhiyong Lu; Yifan Peng; Fabio Rinaldi; Manabu Torii; Alfonso Valencia; Karin Verspoor; Thomas C Wiegers; Cathy H Wu; W John Wilbur
Journal: Database (Oxford) Date: 2013-09-18 Impact factor: 3.451

10. Biomedical named entity extraction: some issues of corpus compatibilities.

Authors: Asif Ekbal; Sriparna Saha; Utpal Kumar Sikdar
Journal: Springerplus Date: 2013-11-12