Literature DB >> 19389726

ANNIE: integrated de novo protein sequence annotation.

Hong Sain Ooi¹, Chia Yee Kwo, Michael Wildpaner, Fernanda L Sirota, Birgit Eisenhaber, Sebastian Maurer-Stroh, Wing Cheong Wong, Alexander Schleiffer, Frank Eisenhaber, Georg Schneider.

Abstract

Function prediction of proteins with computational sequence analysis requires the use of dozens of prediction tools with a bewildering range of input and output formats. Each of these tools focuses on a narrow aspect and researchers are having difficulty obtaining an integrated picture. ANNIE is the result of years of close interaction between computational biologists and computer scientists and automates an essential part of this sequence analytic process. It brings together over 20 function prediction algorithms that have proven sufficiently reliable and indispensable in daily sequence analytic work and are meant to give scientists a quick overview of possible functional assignments of sequence segments in the query proteins. The results are displayed in an integrated manner using an innovative AJAX-based sequence viewer. ANNIE is available online at: http://annie.bii.a-star.edu.sg. This website is free and open to all users and there is no login requirement.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19389726 PMCID： PMC2703921 DOI： 10.1093/nar/gkp254

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Advances in sequencing technology have taken the number of available sequences in databases to unprecedented levels (1). Unfortunately, the ability to determine the sequence of a particular gene has not been accompanied by an equally impressive gain in our ability to achieve insights into the biological function (including molecular and celullar) of these sequences. For example, the full genome sequence of the yeast Saccharomyces cerevisae became available in 1997 (2); nevertheless more than a decade later, of the 6000+ identified genes there are still over 1000 with uncharacterized function (3). In human, more than half of the genes are functionally characterized incompletely or not at all. The classic route to functional characterization involving experimental methods from the genetic and biochemical toolbox-like specific knockouts, targeted mutations and a battery of biochemical assays is time consuming (depending on the model organism, it can take years) and costly. Therefore, there is a strong case for using in silico methods in a preliminary analysis for functional hypothesis generation to direct experimental planning in the laboratory. There are literally hundreds of prediction algorithms described in the literature, although only some of those have a sensitivity and selectivity to be applicable for unsupervised function prediction of arbitrary query protein sequences (4). Each method concentrates on some specific structural or functional aspect of a sequence, e.g. the distribution of unstructured regions (5), its amino acid compositional particularities in sequence windows (6) or the existence of globular domains (7,8). The input formats, method of program invocation as well as the result presentation vary widely making it difficult to interconnect results and obtain an integrated picture of a possible functional assignment. Even when concentrating on a smaller set of reliable prediction methods, the results can still easily exceed several Megabytes of textual (ASCII-type) information, integration of which into an overall functional prediction can be a formidable task requiring days of work per sequence. The need for standardizing automated annotations as well as assessing their quality has been recognized by initiatives such as AFP (9). There have been several attempts to address the interoperability problem (10–12). JAFA (13) is an example of an annotation meta-server that sends a query sequence to several function prediction servers and displays the overlap in Gene Ontology terms (14) as well as providing links to the original results. The ProFunc server (15) combines a range of methods for sequence analysis but requires the 3D structure of the query to be known in advance. There are also a number of databases that provide sequence annotations from various sources like UniProtKB (16) or Ensembl (17) as well as some services that predict a limited set of features for a given input sequence such as SMART (8), InterProScan (18,19) or TarO (20). It should be noted that, frequently, database annotations contain errors and, especially, function descriptions propagated by sequence similarity criteria might be dubious. Therefore, tools for de novo sequence annotation are important for reducing the dependence on potentially misleading or incomplete database comments (21,22). ANNIE is unique in that it has been developed by a collaboration of sequence analysis as well as computer science experts. It provides over 20 of the most useful algorithms (Table 1) covering the first two steps of segment-based sequence analysis (23) that have proven indispensable in daily sequence analytic work for functional discovery (24–26). Of particular value is the inclusion of predictors for a number of post-translational modifications (27–32) as well as targeting signals (33,34) developed in-house.

Table 1.

Sequence analytic algorithms

Algorithm	Description	Parameters
CAST (37)	Algorithm for low-complexity region (LCR) detection and selective masking	Threshold: 40
IUPred (5)	Prediction method for recognizing ordered and intrinsically unstructured/disordered regions in proteins	Prediction type: long disorder
SAPS (6)	Statistical analysis of protein sequences with respect to amino acid composition and simple sequence motifs	n/a
SEG (38)	Prediction of low complexity regions	Three parameter sets: Window-size 12, Locut 2.2, Hicut 2.5 Window-size 25, Locut 3.0, Hicut 3.3 Window-size 45, Locut 3.4, Hicut 3.75
Big-∏ (27–29)	Prediction of protein GPI lipid anchor cleavage sites	Taxon-specific learning set
NMT (30,31)	Prediction of N-terminal N-myristoylation of proteins	Taxon-specific parameter set
PrePS – FT (32)	Farnesylation prediction	n/a
PrePS – GGT1 (32)	Geranylgeranylation prediction	n/a
PrePS – GGT2 (32)	Rab geranylgeranylation Prediction	n/a
PeroPS/PTS1 (33,34)	Prediction of peroxisomal targeting signal 1	Taxon-specific prediction function
DAS-TMfilter (39)	Prediction of transmembrane regions	Quality cutoff: 0.72
HMMTOP (40)	Transmembrane topology prediction using Hidden Markov models	n/a
PHOBIUS (41)	Combined transmembrane topology and signal peptide predictor	n/a
TMHMM (42)	Transmembrane helix predictor	n/a
IMP-COIL (43)	Prediction of coiled-coil regions, modified implementation of the algorithm Lupas et al. by F. Eisenhaber	n/a
PROSITE (44)	Pattern search in the PROSITE database	n/a
PROSITE-Profile (44)	Profile search in the PROSITE database	n/a
HMMER (7)	Profile Hidden Markov Models	SMART (8) with e-value cutoff of 0.001
IMPALA (45)	Tool to compare a query sequence against a library of position-specific scoring matrices	Wolf-library (e-value cutoff: 0.001) (46), Aravind-library (e-value cutoff: 1e-5) (47)
RPS-BLAST against CDD (48)	Reverse-position-specific BLAST against the Conserved Domain Database (CDD)	e-value cutoff: 0.001

Sequence analytic algorithms The results of all algorithms are displayed in an integrated manner using a newly developed interactive sequence viewer as well as a number of views highlighting the distribution of features across sets of sequences. ANNIE enables scientists to gain a quick overview of possible functional assignments in protein sequence sets.

METHODS

Algorithms

Segment-based sequence analysis (23) starts with the assumption that proteins are chains of functional units which can be analyzed independently. The overall function arises from the synthesis of the functions predicted for each individual module. The procedure first uses algorithms for the detection of nonglobular regions, which are segments with a compositional bias or repetitive patterns that often represent linker regions, fibrillar segments, flexible binding sites or points of post-translational modifications (35). The subsequent step is to run algorithms for the identification of known globular domains. These domains are conserved within groups of homologous proteins and are often associated with enzymatic or ligand-binding function. In the last step, it is assumed that the remaining parts of the sequence represent yet uncharacterized globular domains that need to be characterized within the homologous family concept. Iterative heuristic have to be applied to uncover weak links in sequence space and collect a family of protein sequence segments that contain yet unknown globular domains (36). ANNIE provides a selection of algorithms covering the first two steps of this approach. Table 1 lists the algorithms which have been integrated together with a short description, references and the preselected runtime parameters. These parameters have been chosen so as to provide a reasonable compromise between the need to give a comprehensive and sensitive overview of sometimes weak signals and the ability of scientists not trained in sequence analysis to discard false positives. It should be noted that further relaxed parameterization might produce more prediction results; yet, their interpretation might require expert knowledge and experience. ANNIE is based on our extensive in-house sequence analytic pipeline ANNOTATOR, which is used to analyze proteomes and detect distant evolutionary relationships using computationally intensive iterative heuristics (36). The engine behind ANNIE has been in use for several years and has annotated millions of sequences. The online help pages contain a detailed description of each individual algorithm.

User-interface

There are two input methods allowing the user to either paste sequences in FASTA-format (a single sequence can also be pasted without a description line) or upload them from a corresponding FASTA-formatted file. There is currently a limit of 10 sequences per annotation run which might be increased in the future depending on actual usage patterns and the availability of compute server resources. It is highly recommended to include taxonomic information in the classical NCBI square bracket notation at the end of the description line (e.g. [Homo sapiens]) in order for ANNIE to automatically choose the correct parameterization for predictors of post-translational modifications and targeting signals. Additionally, this will enable the user to view the taxonomic distribution of the uploaded sequence set. The annotation process is started by pressing the corresponding ‘Annotate’ button. Requests are queued and, upon availability of resources, sent to a cluster of dedicated CPUs for execution of algorithms and parsing of output. The user will be directed to a page containing the current as well as past results. If an (optional) email address is provided, a message containing a link will be sent once all algorithms have completed. This gives the user access to past annotations for at least 72 h, after which they will be deleted. There are a number of views that allow the user to look at different aspects of the annotation. Upon submission of an annotation request the user will normally click on the corresponding result folder and be presented with a view displaying the uploaded sequences with links to individual results. If a certain algorithm is still queued or running a special symbol will be displayed and the page reloads periodically until all algorithms have terminated (under average load this should take no more than 1 min).

Result view

Following the links for individual algorithms will display the corresponding result together with links to external resources where applicable (e.g. domain descriptions for HMMER). Each result also provides access for validation purposes to the ‘raw’ unparsed data generated by the executable.

Interactive sequence view

Clicking on the protein sequence symbol starts the interactive sequence view (Figure 1). The results of individual algorithms are displayed as rectangles projected onto the sequence ruler. Hovering over regions will display information specific to the result (e.g. e-values of globular domain model hits). Right-clicking on a region will allow examination of the particular feature in greater detail with algorithm-specific information as well as a compositional analysis of the sequence stretch.

Figure 1.

Interactive sequence view. This figure shows an exemplary interactive sequence view using the sequence of Dysferlin. The sequence features found by the various programs are organized in panes that coalesce findings with similar functional significance. The different color coding is just for the purpose of easing navigation. Figure 1 displays the interactive sequence view of Dysferlin (49,50), a protein involved in a number of hereditary myopathies (it is provided as a sample sequence on the main page). The characteristic C2-domains (51) have been detected by a number of distinct tools (HMMER against Smart, IMPALA against Wolf-Library, PROSITE-Profile search, RPS-Blast against CDD) giving enhanced confidence to that particular finding. The detection of a C-terminal membrane-embedded region by three different methods also lends plausibility to the claim that Dysferlin is a transmembrane protein. It should be noted that there is a seventh C2-domain not shown in this view between residues 1338 and 1437 (the e-value = 0.025 is above the default threshold of 0.001), Due to the AJAX-based technology of the viewer, zooming and panning is almost instantaneous, allowing fast and concise drill-down to a particular region. Additional feature-specific information can be obtained by right-clicking on a region. This will lead to a detailed compositional analysis of the sequence stretch and, were applicable, include alignment data as well as links to external resources.

Set view

Uploading several sequences at once opens up the possibility to analyze the frequency of certain features within that set of sequences. ANNIE provides a special view called ‘Histogram’ (Figure 2). This view displays features found with diverse algorithms sorted by the number of occurrences. Clicking on the name of the feature will link to all the sequences in which it has been detected.

Figure 2.

Histogram view. This view shows the occurrence of sequence features in the sequence set under investigation. The features are sorted by their number of incidences in the set. Clicking on the link provided with the feature name will generate the sublist of sequences with this feature. In this example of Eco1-type proteins, the top four entries in the histogram are related to low-complexity regions as well as short motifs from PROSITE that are less reliable predictions. The fifth entry indicates the occurrence of the KOG3014 domain model that is characteristic for the Eco1-class of proteins necessary for the establishment of sister chromatid cohesion in mitosis.

Figure 3.

Taxonomy view. The taxonomic distribution of the sequence set is displayed. The numbers in brackets refer to the number of sequences below a branch in the taxonomic tree and those assigned to a particular taxon. For the given Eco1 example set, this view shows that it contains one plant sequence (Arabidopsis thaliana) together with a trypanosome, one fungal sequence and four from Bilateria.

CONCLUSIONS AND OUTLOOK

We have presented ANNIE, a comprehensive de novo protein annotation system that integrates a large number of indispensable algorithms used in everyday sequence analytic work. The results of individual algorithms can be accessed separately or displayed together in an interactive AJAX-based sequence viewer. There are additional views for assessing the frequency of certain features across a set of sequences as well as revealing its taxonomic distribution. New algorithms appearing in the literature are constantly being evaluated as to their potential contribution for function discovery and are eventually integrated. Future work will also see the inclusion of algorithms from the third step of segment-based sequence analysis if the necessary computational resources can be obtained.

FUNDING

Bioinformatics Insitute, A*Star, Singapore; Boehringer Ingelheim (2001-2007); and the Austrian Gen-AU BIN program (2004-2007) when the Eisenhaber group was located at the Research Institute of Molecular Pathology in Vienna (Austria). Funding for open access charge: Bioinformatics Institute, A*Star, Singapore. Conflict of interest statement. None declared.

50 in total

Review 1. Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure?

Authors: Birgit Eisenhaber; Frank Eisenhaber
Journal: Curr Protein Pept Sci Date: 2007-04 Impact factor: 3.272

2. Prediction of potential GPI-modification sites in proprotein sequences.

Authors: B Eisenhaber; P Bork; F Eisenhaber
Journal: J Mol Biol Date: 1999-09-24 Impact factor: 5.469

3. Dysferlin is a surface membrane-associated protein that is absent in Miyoshi myopathy.

Authors: C Matsuda; M Aoki; Y K Hayashi; M F Ho; K Arahata; R H Brown
Journal: Neurology Date: 1999-09-22 Impact factor: 9.910

4. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins.

Authors: Zsuzsanna Dosztányi; Veronika Csizmók; Péter Tompa; István Simon
Journal: J Mol Biol Date: 2005-04-08 Impact factor: 5.469

Review 5. Why are there still over 1000 uncharacterized yeast genes?

Authors: Lourdes Peña-Castillo; Timothy R Hughes
Journal: Genetics Date: 2007-04-15 Impact factor: 4.562

6. Refinement and prediction of protein prenylation motifs.

Authors: Sebastian Maurer-Stroh; Frank Eisenhaber
Journal: Genome Biol Date: 2005-05-27 Impact factor: 13.583

7. ProFunc: a server for predicting protein function from 3D structure.

Authors: Roman A Laskowski; James D Watson; Janet M Thornton
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. JAFA: a protein function annotation meta-server.

Authors: Iddo Friedberg; Tim Harder; Adam Godzik
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. Taverna: a tool for building and running workflows of services.

Authors: Duncan Hull; Katy Wolstencroft; Robert Stevens; Carole Goble; Mathew R Pocock; Peter Li; Tom Oinn
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10. Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases.

Authors: Georg Schneider; Georg Neuberger; Michael Wildpaner; Sun Tian; Igor Berezovsky; Frank Eisenhaber
Journal: BMC Bioinformatics Date: 2006-03-21 Impact factor: 3.169

26 in total

1. Nuclear import of a lipid-modified transcription factor: mobilization of NFAT5 isoform a by osmotic stress.

Authors: Birgit Eisenhaber; Michaela Sammer; Wai Heng Lua; Wolfgang Benetka; Lai Ling Liew; Weimiao Yu; Hwee Kuan Lee; Manfred Koranda; Frank Eisenhaber; Sharmila Adhikari
Journal: Cell Cycle Date: 2011-11-15 Impact factor: 4.534

2. F-BAR domain proteins: Families and function.

Authors: Sohail Ahmed; Wenyu Bu; Raphael Tze Chuen Lee; Sebastian Maurer-Stroh; Wah Ing Goh
Journal: Commun Integr Biol Date: 2010-03

3. Mutations in ROGDI Cause Kohlschütter-Tönz Syndrome.

Authors: Anna Schossig; Nicole I Wolf; Christine Fischer; Maria Fischer; Gernot Stocker; Stephan Pabinger; Andreas Dander; Bernhard Steiner; Otmar Tönz; Dieter Kotzot; Edda Haberlandt; Albert Amberger; Barbara Burwinkel; Katharina Wimmer; Christine Fauth; Caspar Grond-Ginsbach; Martin J Koch; Annette Deichmann; Christof von Kalle; Claus R Bartram; Alfried Kohlschütter; Zlatko Trajanoski; Johannes Zschocke
Journal: Am J Hum Genet Date: 2012-03-15 Impact factor: 11.025

4. Purification and crystallization of yeast glycosylphosphatidylinositol transamidase subunit PIG-S (PIG-S(71-467)).

Authors: Neelagandan Kamariah; Frank Eisenhaber; Sharmila Adhikari; Birgit Eisenhaber; Gerhard Grüber
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2011-07-19

5. Genome-wide analysis of cell wall-related genes in Tuber melanosporum.

Authors: Raffaella Balestrini; Fabiano Sillo; Annegret Kohler; Georg Schneider; Antonella Faccio; Emilie Tisserant; Francis Martin; Paola Bonfante
Journal: Curr Genet Date: 2012-04-06 Impact factor: 3.886

6. Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset.

Authors: Fernanda L Sirota; Hong-Sain Ooi; Tobias Gattermayer; Georg Schneider; Frank Eisenhaber; Sebastian Maurer-Stroh
Journal: BMC Genomics Date: 2010-02-10 Impact factor: 3.969

7. More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology.

Authors: Wing-Cheong Wong; Sebastian Maurer-Stroh; Frank Eisenhaber
Journal: PLoS Comput Biol Date: 2010-07-29 Impact factor: 4.475

8. Whole exome sequencing identifies a mutation for a novel form of corneal intraepithelial dyskeratosis.

Authors: Vincent José Soler; Khanh-Nhat Tran-Viet; Stéphane D Galiacy; Vachiranee Limviphuvadh; Thomas Patrick Klemm; Elizabeth St Germain; Pierre R Fournié; Céline Guillaud; Sebastian Maurer-Stroh; Felicia Hawthorne; Cyrielle Suarez; Bernadette Kantelip; Natalie A Afshari; Isabelle Creveaux; Xiaoyan Luo; Weihua Meng; Patrick Calvas; Myriam Cassagne; Jean-Louis Arné; Steven G Rozen; François Malecaze; Terri L Young
Journal: J Med Genet Date: 2013-01-24 Impact factor: 6.318

9. Brief overview of bioinformatics activities in Singapore.

Authors: Frank Eisenhaber; Chee-Keong Kwoh; See-Kiong Ng; Wing-Kin Sung; Wing-King Sung; Limsoon Wong
Journal: PLoS Comput Biol Date: 2009-09-25 Impact factor: 4.475

10. Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites.

Authors: Sebastian Maurer-Stroh; Jianmin Ma; Raphael Tze Chuen Lee; Fernanda L Sirota; Frank Eisenhaber
Journal: Biol Direct Date: 2009-05-20 Impact factor: 4.540