Literature DB >> 19995463

Investigating heterogeneous protein annotations toward cross-corpora utilization.

Yue Wang1, Jin-Dong Kim, Rune Saetre, Sampo Pyysalo, Jun'ichi Tsujii.   

Abstract

BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.
RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.
CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19995463      PMCID: PMC2804683          DOI: 10.1186/1471-2105-10-403

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  13 in total

1.  Disambiguating proteins, genes, and RNA in text: a machine learning approach.

Authors:  V Hatzivassiloglou; P A Duboué; A Rzhetsky
Journal:  Bioinformatics       Date:  2001       Impact factor: 6.937

2.  Protein names and how to find them.

Authors:  Kristofer Franzén; Gunnar Eriksson; Fredrik Olsson; Lars Asker; Per Lidén; Joakim Cöster
Journal:  Int J Med Inform       Date:  2002-12-04       Impact factor: 4.046

3.  GENIA corpus--semantically annotated corpus for bio-textmining.

Authors:  J-D Kim; T Ohta; Y Tateisi; J Tsujii
Journal:  Bioinformatics       Date:  2003       Impact factor: 6.937

Review 4.  Mining the biomedical literature in the genomic era: an overview.

Authors:  Hagit Shatkay; Ronen Feldman
Journal:  J Comput Biol       Date:  2003       Impact factor: 1.479

Review 5.  A survey of current work in biomedical text mining.

Authors:  Aaron M Cohen; William R Hersh
Journal:  Brief Bioinform       Date:  2005-03       Impact factor: 11.622

6.  POSBIOTM-NER: a trainable biomedical named-entity recognition system.

Authors:  Yu Song; Eunju Kim; Gary Geunbae Lee; Byoung-Kee Yi
Journal:  Bioinformatics       Date:  2005-04-06       Impact factor: 6.937

7.  Various criteria in the evaluation of biomedical named entity recognition.

Authors:  Richard Tzong-Han Tsai; Shih-Hung Wu; Wen-Chi Chou; Yu-Chun Lin; Ding He; Jieh Hsiang; Ting-Yi Sung; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2006-02-24       Impact factor: 3.169

8.  GENETAG: a tagged corpus for gene/protein named entity recognition.

Authors:  Lorraine Tanabe; Natalie Xie; Lynne H Thom; Wayne Matten; W John Wilbur
Journal:  BMC Bioinformatics       Date:  2005-05-24       Impact factor: 3.169

9.  Comparative analysis of five protein-protein interaction corpora.

Authors:  Sampo Pyysalo; Antti Airola; Juho Heimonen; Jari Björne; Filip Ginter; Tapio Salakoski
Journal:  BMC Bioinformatics       Date:  2008-04-11       Impact factor: 3.169

10.  Corpus refactoring: a feasibility study.

Authors:  Helen L Johnson; William A Baumgartner; Martin Krallinger; K Bretonnel Cohen; Lawrence Hunter
Journal:  J Biomed Discov Collab       Date:  2007-09-13
View more
  12 in total

1.  The synaptoneurosome transcriptome: a model for profiling the emolecular effects of alcohol.

Authors:  D Most; L Ferguson; Y Blednov; R D Mayfield; R A Harris
Journal:  Pharmacogenomics J       Date:  2014-08-19       Impact factor: 3.550

2.  Complex event extraction at PubMed scale.

Authors:  Jari Björne; Filip Ginter; Sampo Pyysalo; Jun'ichi Tsujii; Tapio Salakoski
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

3.  Pooling annotated corpora for clinical concept extraction.

Authors:  Kavishwar B Wagholikar; Manabu Torii; Siddhartha R Jonnalagadda; Hongfang Liu
Journal:  J Biomed Semantics       Date:  2013-01-08

4.  An analysis of gene/protein associations at PubMed scale.

Authors:  Sampo Pyysalo; Tomoko Ohta; Jun'ichi Tsujii
Journal:  J Biomed Semantics       Date:  2011-10-06

5.  Event extraction for DNA methylation.

Authors:  Tomoko Ohta; Sampo Pyysalo; Makoto Miwa; Jun'ichi Tsujii
Journal:  J Biomed Semantics       Date:  2011-10-06

6.  Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011.

Authors:  Sampo Pyysalo; Tomoko Ohta; Rafal Rak; Dan Sullivan; Chunhong Mao; Chunxia Wang; Bruno Sobral; Jun'ichi Tsujii; Sophia Ananiadou
Journal:  BMC Bioinformatics       Date:  2012-06-26       Impact factor: 3.169

7.  Wide coverage biomedical event extraction using multiple partially overlapping corpora.

Authors:  Makoto Miwa; Sampo Pyysalo; Tomoko Ohta; Sophia Ananiadou
Journal:  BMC Bioinformatics       Date:  2013-06-03       Impact factor: 3.169

8.  Using empirically constructed lexical resources for named entity recognition.

Authors:  Siddhartha Jonnalagadda; Trevor Cohen; Stephen Wu; Hongfang Liu; Graciela Gonzalez
Journal:  Biomed Inform Insights       Date:  2013-06-24

9.  BioC: a minimalist approach to interoperability for biomedical text processing.

Authors:  Donald C Comeau; Rezarta Islamaj Doğan; Paolo Ciccarese; Kevin Bretonnel Cohen; Martin Krallinger; Florian Leitner; Zhiyong Lu; Yifan Peng; Fabio Rinaldi; Manabu Torii; Alfonso Valencia; Karin Verspoor; Thomas C Wiegers; Cathy H Wu; W John Wilbur
Journal:  Database (Oxford)       Date:  2013-09-18       Impact factor: 3.451

10.  Biomedical named entity extraction: some issues of corpus compatibilities.

Authors:  Asif Ekbal; Sriparna Saha; Utpal Kumar Sikdar
Journal:  Springerplus       Date:  2013-11-12
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.