Literature DB >> 21310745

GeneReporter--sequence-based document retrieval and annotation.

Annekathrin Bartsch¹, Boyke Bunk, Isam Haddad, Johannes Klein, Richard Münch, Thorsten Johl, Uwe Kärst, Lothar Jänsch, Dieter Jahn, Ida Retter.

Abstract

UNLABELLED: GeneReporter is a web tool that reports functional information and relevant literature on a protein-coding sequence of interest. Its purpose is to support both manual genome annotation and document retrieval. PubMed references corresponding to a sequence are detected by the extraction of query words from UniProt entries of homologous sequences. Data on protein families, domains, potential cofactors, structure, function, cellular localization, metabolic contribution and corresponding DNA binding sites complement the information on a given gene product of interest.
AVAILABILITY AND IMPLEMENTATION: GeneReporter is available at http://www.genereporter.tu-bs.de. The web site integrates databases and analysis tools as SOAP-based web services from the EBI (European Bioinformatics Institute) and NCBI (National Center for Biotechnology Information).

Entities: Species

Mesh：

Substances：
Proteins

Year: 2011 PMID： 21310745 PMCID： PMC3065684 DOI： 10.1093/bioinformatics/btr047

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

In face of next-generation sequencing and high-throughput analyses, the link between obtained data and existing knowledge is crucial. Automatic annotation pipelines provide useful evidence of potential functions for genes and proteins, but in a last essential step, the scientist must manually evaluate the available information. Usually, the necessary evidence is derived from scientific publications, databases and in silico predictions. Thus, tools that provide a combination of all of these relevant data for a gene or protein of interest are of high practical impact. In this context, GeneReporter offers a customizable workflow for the integrated application of protein sequence analysis and document retrieval. A large number of diverse text-mining tools exist that provide different strategies and interfaces to satisfy the extensive data-mining demands in biomedical sciences (Krallinger ). GeneReporter identifies citations related to a gene or protein sequence of interest. The UniProt annotations of homologous sequences are used to derive keywords such as gene names, synonyms and species. These keywords provide the query terms for a subsequent literature search in PubMed (Sayers ). In this way, GeneReporter extends and replaces MineBlast (Dieterich ), a similar tool which is discontinued. In comparison with other tools that connect literature to sequence information, like quickLit (Gilchrist ) and Metis (Mitchell ), GeneReporter is characterized by highly customizable query options, the integration of InterPro and the direct access to the original EBI and NCBI databases.

2 REQUESTING LITERATURE AND SEQUENCE ANALYSIS

The user can enter up to 10 nt or protein sequences to submit a query on the GeneReporter web site. Two different types of analyses are provided: (i) homology-based document retrieval searches information on homologous sequences from the UniProt Knowledgebase (UniProt Consortium, 2010) and citations from PubMed. (ii) Analysis of the protein sequences requests protein annotations from InterPro (Hunter ), Phobius (Käll ) and PrediSi (Hiller ). The complete workflow is depicted in Figure 1. An example for an application is given as Supplementary Material.

Fig. 1.

Workflow of the GeneReporter analysis process. Arrows indicate data transfer and processing. Input and output is depicted as rectangles, web services are depicted as rounded rectangles.

Workflow of the GeneReporter analysis process. Arrows indicate data transfer and processing. Input and output is depicted as rectangles, web services are depicted as rounded rectangles. Using homology-based document retrieval, the first step is a BLAST search in UniProtKB, where the user can select the desired algorithm. NCBI-BLAST (Altschul ) and WU-BLAST (Lopez ) result in a different ranking of homology matches, and therefore yield different query word extractions from the respective UniProtKB entries. PSI-BLAST (Altschul ) is the most sensitive algorithm and beneficial for sequences that fail to result in significant hits with the other algorithms. Either Swiss-Prot or the complete UniProtKB can be chosen as BLAST target database. The UniProtKB entries of the resulting BLAST hits are parsed for gene names, synonyms and species names, which are used as query terms for the subsequent PubMed request. This literature search can be further specified, e.g. by additional query terms and years of publication. The option ‘organism-specific search’ adds the respective species name to the PubMed search string. Query and result options and the construction of the PubMed queries are described in detail in the Supplementary Material. For further analysis of the protein sequence, GeneReporter submits a query to InterProScan that matches the sequence against InterPro. This database comprises predictive signatures that assign protein families, various domains and functional sites for a protein of interest. The input sequence can also be analysed by Phobius and PrediSi, which search for putative transmembrane regions and signal peptides. To assure long-term up-to-date datasets and analysis tools, GeneReporter utilizes standardized web services from the EBI (Goujon ), the NCBI (Sayers ) and our institute. The processing time of a query strongly depends on these services. The web service providers bind the access of their services to certain rules in order to avoid overload and abuse of their resources. To match these rules, a local queuing system monitors and limits the number of simultaneous queries. Details on cut-offs and limits are provided as Supplementary Material.

3 RESULTS

The results are summarized on an overview page. For each query sequence, this page provides a link to a detailed view of the obtained data for the requested services. The result overview page can be bookmarked and results can be retrieved from this URL for at least 24 h. For further analysis, results can be downloaded as Excel or tab-delimited text files. The detailed view provides one result tab for each requested service. The BLAST result tab shows homologous protein sequences. It is complemented with annotations from the UniProt database, e.g. organism name and GO terms, in order to facilitate their evaluation. The PubMed result tab shows gene-related citations ordered by the respective PubMed queries. Query words that were matched within title and abstract are marked in bold. For each query word combination, the link ‘This query in PubMed’ performs the corresponding query on the PubMed web site. This allows the manual modification and specification of the automatically generated queries with all the sophisticated features of the PubMed search interface. Furthermore, GeneReporter provides citations from UniProt entries of the BLAST hit sequences. In general, these references comprise the key papers on the respective gene or protein. Figure 2 shows the PubMed tab of an example search for a hypothetical protein from Pseudomonas aeruginosa C3719.

Fig. 2.

Screenshot of the homology-based document retrieval result. Query sequence in this example: UniProt AcNo A3KZR4.

Screenshot of the homology-based document retrieval result. Query sequence in this example: UniProt AcNo A3KZR4. The output from InterPro, Phobius and PrediSi requests is given in additional tabs. The InterProScan and Phobius output includes graphical visualizations of signature matches and transmembrane regions within the proteins of interest.

12 in total

1. PrediSi: prediction of signal peptides and their cleavage positions.

Authors: Karsten Hiller; Andreas Grote; Maurice Scheer; Richard Münch; Dieter Jahn
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

2. METIS: multiple extraction techniques for informative sentences.

Authors: A L Mitchell; A Divoli; J-H Kim; M Hilario; I Selimas; T K Attwood
Journal: Bioinformatics Date: 2005-09-13 Impact factor: 6.937

3. MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results.

Authors: Guido Dieterich; Uwe Kärst; Jürgen Wehland; Lothar Jänsch
Journal: Bioinformatics Date: 2005-06-07 Impact factor: 6.937

Review 4. Analysis of biological processes and diseases using text mining approaches.

Authors: Martin Krallinger; Florian Leitner; Alfonso Valencia
Journal: Methods Mol Biol Date: 2010

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. A new bioinformatics analysis tools framework at EMBL-EBI.

Authors: Mickael Goujon; Hamish McWilliam; Weizhong Li; Franck Valentin; Silvano Squizzato; Juri Paern; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2010-05-03 Impact factor: 16.971

7. WU-Blast2 server at the European Bioinformatics Institute.

Authors: Rodrigo Lopez; Ville Silventoinen; Stephen Robinson; Asif Kibria; Warren Gish
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

8. The Universal Protein Resource (UniProt) in 2010.

Authors:
Journal: Nucleic Acids Res Date: 2009-10-20 Impact factor: 16.971

9. InterPro: the integrative protein signature database.

Authors: Sarah Hunter; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D Finn; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Aurélie Laugraud; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Nicola Mulder; Darren Natale; Christine Orengo; Antony F Quinn; Jeremy D Selengut; Christian J A Sigrist; Manjula Thimma; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

10. Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.

Authors: Michael J Gilchrist; Mikkel B Christensen; Richard Harland; Nicolas Pollet; James C Smith; Naoto Ueno; Nancy Papalopulu
Journal: BMC Bioinformatics Date: 2008-10-17 Impact factor: 3.169

1 in total

1. PubServer: literature searches by homology.

Authors: Lukasz Jaroszewski; Laszlo Koska; Mayya Sedova; Adam Godzik
Journal: Nucleic Acids Res Date: 2014-06-23 Impact factor: 16.971

1 in total