Literature DB >> 25031571

SFannotation: A Simple and Fast Protein Function Annotation System.

Abstract

Owing to the generation of vast amounts of sequencing data by using cost-effective, high-throughput sequencing technologies with improved computational approaches, many putative proteins have been discovered after assembly and structural annotation. Putative proteins are typically annotated using a functional annotation system that uses extant databases, but the expansive size of these databases often causes a bottleneck for rapid functional annotation. We developed SFannotation, a simple and fast functional annotation system that rapidly annotates putative proteins against four extant databases, Swiss-Prot, TIGRFAMs, Pfam, and the non-redundant sequence database, by using a best-hit approach with BLASTP and HMMSEARCH.

Entities: Species

Keywords: bioinformatics; gene product; protein annotation

Year: 2014 PMID： 25031571 PMCID： PMC4099352 DOI： 10.5808/GI.2014.12.2.76

Source DB: PubMed Journal: Genomics Inform ISSN： 1598-866X

Introduction

Functional annotation of putative proteins is a fundamental and essential practice in the postgenomics era [1]; it allows us to analyze genomic and genetic features, such as physiological activity and metabolism, as well as to discover medically and industrially relevant enzymes. Since large numbers of putative proteins were discovered from a vast amount of sequencing data generated using high-throughput sequencing technologies, including those of the next and third generation, many automated functional annotation systems have contributed greatly to the annotation of them with minimal manual effort [2]. However, their runtime performance of functional annotation against large extant databases often causes a bottleneck, and especially, standalone tools, such as AutoFACT [3] and BLANNOTOR [4], demand high-performance hardware resources for fast annotation from users. From the user's perspective, a web-based annotation server system would be a useful tool to bypass the demands of high-performance computer resources, and besides, they offer user-friendly interfaces. The RAST server system is particularly popular and can be used to rapidly annotate many microbial proteins against a specially curated subsystem database [5]. Web server systems, however, may be undesirable because of critical obstacles, such as the limitation of usable server resources, a long waiting time by many queries, a low-bandwidth network or unstable traffic flow associated with the upload of query data and download of outputs, and data security problems. Thus, some users prefer standalone systems to web-based systems in spite of the demand for high-performance resources. Although standalone and web-based systems have good and bad points, slow runtime performance in themselves cannot be avoided because of the exponential increase in database sizes, without controlling some aspect of the annotation workflow. We developed SFannotation, which rapidly annotates putative proteins by using single or bidirectional best-hit approach with sequence-based methods-BLASTP [6] and HMMSEARCH [7]-against big extant databases: Swiss-Prot [8], TIGRFAMs [9], Pfam [10], and the non-redundant sequence database (NR) of NCBI [11]. As best-hit approaches, especially bidirectional best-hit [12], have been widely utilized in searching reliable homologous protein sequences, such as orthologs, as well as functional annotation systems [13,14,15,16], SFannotation can reliably annotate putative proteins. Remarkably, SFannotation can rapidly annotate proteins against large extant databases by our hierarchical workflow.

Methods and Results

Before annotating putative proteins against Swiss-Prot, TIGRFAMs, Pfam, and the NR database, SFannotation filters out all proteins described in the databases by terms, such as "unknown," "hypothetical," "unclassified," "uncharacterized," "putative," "predicted," and "conserved" (Fig. 1A), because some putative proteins may be misannotated by their inclusion. Then, using BLASTP and HMMSEARCH, SFannotation searches homologous proteins and domains in each refined database using a default threshold (≤10-5 E-value) and selects the highest-scoring homolog to annotate putative proteins as the best-hit approach, such as single best hit and bidirectional best hit [12, 16].

Fig. 1

Database filtration (A) and workflow of the SFannotation annotation system (B). Black arrows represent putative proteins that are annotated by the best-hit approach, and red arrows represent the conversion of unannotated proteins to query putative proteins to search homologs against other databases.

Putative proteins are hierarchically annotated using the following database priority: Swiss-Prot → TIGRFAMs → Pfam → NR, which is ordered according to their reliability (Fig. 1B). Once annotated, the putative proteins are no longer queried using homology searches against the other databases. For example, if a putative protein is annotated against Swiss-Prot, it is excluded from annotation against the other databases, while the remaining unannotated putative proteins continue to be annotated against the other databases. Therefore, the runtime performance can be reduced, because the number of unannotated putative proteins gradually decreases (Fig. 2).

Fig. 2

Runtime of the SFannotation system (red) and a best-hit approach without the hierarchical SFannotation workflow (black). Randomly selected proteins from Escherichia coli MG 1655 (GenBank accession number: U00096) were tested using a 64-bit Linux system (Ubuntu) possessing 20 CPU threads.

Implementation

SFannotation is written in Perl and bash shell and is implemented on a Linux/Unix system on which BLASTP and HMMSEARCH are able to function. SFannotation automatically annotates putative proteins with downloading of all four databases, as well as BLASTP and HMMSEARCH. SFannotation is implemented by a command line on the Linux/Unix system: "perl SFannotation --download --fasta --speedup" (Supplementary Fig. 1).

16 in total

1. The use of gene clusters to infer functional coupling.

Authors: R Overbeek; M Fonstein; M D'Souza; G D Pusch; N Maltsev
Journal: Proc Natl Acad Sci U S A Date: 1999-03-16 Impact factor: 11.205

2. Bacterial genome annotation.

Authors: Nicholas Beckloff; Shawn Starkenburg; Tracey Freitas; Patrick Chain
Journal: Methods Mol Biol Date: 2012

3. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors: Li Li; Christian J Stoeckert; David S Roos
Journal: Genome Res Date: 2003-09 Impact factor: 9.043

4. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

5. BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins.

Authors: Matti Kankainen; Teija Ojala; Liisa Holm
Journal: BMC Bioinformatics Date: 2012-02-15 Impact factor: 3.169

6. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

Authors: Gabriel Ostlund; Thomas Schmitt; Kristoffer Forslund; Tina Köstler; David N Messina; Sanjit Roopra; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2009-11-05 Impact factor: 16.971

7. A large-scale evaluation of computational protein function prediction.

Authors: Predrag Radivojac; Wyatt T Clark; Tal Ronnen Oron; Alexandra M Schnoes; Tobias Wittkop; Artem Sokolov; Kiley Graim; Christopher Funk; Karin Verspoor; Asa Ben-Hur; Gaurav Pandey; Jeffrey M Yunes; Ameet S Talwalkar; Susanna Repo; Michael L Souza; Damiano Piovesan; Rita Casadio; Zheng Wang; Jianlin Cheng; Hai Fang; Julian Gough; Patrik Koskinen; Petri Törönen; Jussi Nokso-Koivisto; Liisa Holm; Domenico Cozzetto; Daniel W A Buchan; Kevin Bryson; David T Jones; Bhakti Limaye; Harshal Inamdar; Avik Datta; Sunitha K Manjari; Rajendra Joshi; Meghana Chitale; Daisuke Kihara; Andreas M Lisewski; Serkan Erdin; Eric Venner; Olivier Lichtarge; Robert Rentzsch; Haixuan Yang; Alfonso E Romero; Prajwal Bhat; Alberto Paccanaro; Tobias Hamp; Rebecca Kaßner; Stefan Seemayer; Esmeralda Vicedo; Christian Schaefer; Dominik Achten; Florian Auer; Ariane Boehm; Tatjana Braun; Maximilian Hecht; Mark Heron; Peter Hönigschmid; Thomas A Hopf; Stefanie Kaufmann; Michael Kiening; Denis Krompass; Cedric Landerer; Yannick Mahlich; Manfred Roos; Jari Björne; Tapio Salakoski; Andrew Wong; Hagit Shatkay; Fanny Gatzmann; Ingolf Sommer; Mark N Wass; Michael J E Sternberg; Nives Škunca; Fran Supek; Matko Bošnjak; Panče Panov; Sašo Džeroski; Tomislav Šmuc; Yiannis A I Kourmpetis; Aalt D J van Dijk; Cajo J F ter Braak; Yuanpeng Zhou; Qingtian Gong; Xinran Dong; Weidong Tian; Marco Falda; Paolo Fontana; Enrico Lavezzo; Barbara Di Camillo; Stefano Toppo; Liang Lan; Nemanja Djuric; Yuhong Guo; Slobodan Vucetic; Amos Bairoch; Michal Linial; Patricia C Babbitt; Steven E Brenner; Christine Orengo; Burkhard Rost; Sean D Mooney; Iddo Friedberg
Journal: Nat Methods Date: 2013-01-27 Impact factor: 28.547

8. The SWISS-MODEL Repository and associated resources.

Authors: Florian Kiefer; Konstantin Arnold; Michael Künzli; Lorenza Bordoli; Torsten Schwede
Journal: Nucleic Acids Res Date: 2008-10-18 Impact factor: 16.971

9. NCBI BLAST: a better web interface.

Authors: Mark Johnson; Irena Zaretskaya; Yan Raytselis; Yuri Merezhuk; Scott McGinnis; Thomas L Madden
Journal: Nucleic Acids Res Date: 2008-04-24 Impact factor: 16.971

10. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

Authors: Ross Overbeek; Robert Olson; Gordon D Pusch; Gary J Olsen; James J Davis; Terry Disz; Robert A Edwards; Svetlana Gerdes; Bruce Parrello; Maulik Shukla; Veronika Vonstein; Alice R Wattam; Fangfang Xia; Rick Stevens
Journal: Nucleic Acids Res Date: 2013-11-29 Impact factor: 16.971