Literature DB >> 20498514

Efficient extraction of protein-protein interactions from full-text articles.

Jörg Hakenberg1, Robert Leaman, Nguyen Ha Vo, Siddhartha Jonnalagadda, Ryan Sullivan, Christopher Miller, Luis Tari, Chitta Baral, Graciela Gonzalez.   

Abstract

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend- - third-party software are available as supplementary information (see Appendix).

Mesh:

Year:  2010        PMID: 20498514     DOI: 10.1109/TCBB.2010.51

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  14 in total

1.  Cross-species gene normalization by species inference.

Authors:  Chih-Hsuan Wei; Hung-Yu Kao
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

2.  Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations.

Authors:  Yuan Luo; Özlem Uzuner; Peter Szolovits
Journal:  Brief Bioinform       Date:  2016-02-05       Impact factor: 11.622

3.  BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction.

Authors:  Siddhartha Jonnalagadda; Graciela Gonzalez
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

4.  Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles.

Authors:  Richard Tzong-Han Tsai; Po-Ting Lai
Journal:  BMC Bioinformatics       Date:  2011-02-23       Impact factor: 3.169

5.  Assessment of software testing and quality assurance in natural language processing applications and a linguistically inspired approach to improving it.

Authors:  K Bretonnel Cohen; Lawrence E Hunter; Martha Palmer
Journal:  Trust Eternal Syst Via Evol Softw Data Knowl (2012)       Date:  2013

6.  Text mining for modeling of protein complexes enhanced by machine learning.

Authors:  Varsha D Badal; Petras J Kundrotas; Ilya A Vakser
Journal:  Bioinformatics       Date:  2021-05-01       Impact factor: 6.937

7.  PDF text classification to leverage information extraction from publication reports.

Authors:  Duy Duc An Bui; Guilherme Del Fiol; Siddhartha Jonnalagadda
Journal:  J Biomed Inform       Date:  2016-04-01       Impact factor: 6.317

8.  BioCreative III interactive task: an overview.

Authors:  Cecilia N Arighi; Phoebe M Roberts; Shashank Agarwal; Sanmitra Bhattacharya; Gianni Cesareni; Andrew Chatr-Aryamontri; Simon Clematide; Pascale Gaudet; Michelle Gwinn Giglio; Ian Harrow; Eva Huala; Martin Krallinger; Ulf Leser; Donghui Li; Feifan Liu; Zhiyong Lu; Lois J Maltais; Naoaki Okazaki; Livia Perfetto; Fabio Rinaldi; Rune Sætre; David Salgado; Padmini Srinivasan; Philippe E Thomas; Luca Toldo; Lynette Hirschman; Cathy H Wu
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

9.  Sequential pattern mining for discovering gene interactions and their contextual information from biomedical texts.

Authors:  Peggy Cellier; Thierry Charnois; Marc Plantevit; Christophe Rigotti; Bruno Crémilleux; Olivier Gandrillon; Jiří Kléma; Jean-Luc Manguin
Journal:  J Biomed Semantics       Date:  2015-05-18

10.  PCorral--interactive mining of protein interactions from MEDLINE.

Authors:  Chen Li; Antonio Jimeno-Yepes; Miguel Arregui; Harald Kirsch; Dietrich Rebholz-Schuhmann
Journal:  Database (Oxford)       Date:  2013-05-02       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.