Literature DB >> 15980514

Integrating protein annotation resources through the Distributed Annotation System.

Abstract

Using the Distributed Annotation System (DAS) we have created a protein annotation resource available at our web page: http://www.cbs.dtu.dk, as a part of the BioSapiens Network of Excellence EU FP6 project. The DAS protocol allows us to gather layers of annotation data for a given sequence and thereby gain an overview of the sequence's features. A user-friendly graphical client has also been developed (http://www.cbs.dtu.dk/cgi-bin/das), which demonstrates the possibility of integrating DAS annotation data from multiple sources into a simple graphical view. The client displays protein feature annotations from the Center for Biological Sequence Analysis as well as from the BioSapiens reference UniProt server (http://www.ebi.ac.uk/das-srv/uniprot/das) at the European Bioinformatics Institute. Other DAS data sources for protein annotation will be added as they become available.

Entities: Chemical Gene Species

Mesh：

Year: 2005 PMID： 15980514 PMCID： PMC1160224 DOI： 10.1093/nar/gki463

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In recent years, numerous computational tools for gene and protein analysis have been constructed by various laboratories. Several such analysis tools have been created and published by the Center for Biological Sequence Analysis (CBS), many of which are available online for all users at the CBS web page: . The analysis results of such tools have led to an explosion in the amount of data in biological databases and available information that exists for biological sequences. Today, one of the major tasks of systems biology is to integrate as much of the experimental and computational information as possible and thereby gain biological insight into the properties and function of the macromolecules under observation. This means the integration of several types of data, in various formats, dispersed around the face of the globe into a unified structure. This integration of online annotations is greatly simplified if the annotation services follow accepted standards. One such standard is the Distributed Annotation System (DAS) (1). DAS services have existed for several years now. Version 1.0 of the DAS specification was released in 2001 and version 2 is under development. The DAS protocol is a simple http-based client-server system. A query in the form of a URL is made to the server, which replies with annotations for the sequence entry specified in the URL query. The reply from the server is in XML format. The DAS web page () has both Perl- and Java-based server software for download. Client libraries in Perl and Java are also available. The DAS specification was originally written with genomic sequences in mind, but the standard has proven itself flexible enough to handle protein data as well. Several annotation databases are now serving annotations using the DAS system, including Ensembl (2), FlyBase (3), UniProt (4) and WormBase (5). The flexibility and success of the DAS protocol has made it the annotation method of choice for the BioSapiens Network of Excellence, of which the CBS DAS server detailed here is a part. The various consortium members will in the near future deploy several DAS servers, which will serve protein annotations for the same UniProt sequences as the DAS server at CBS and all the data can therefore be easily integrated in a coherent manner. The full list of query types that the DAS specification supports is beyond the scope of this document. We refer readers to the DAS web page and specification for detailed information and it suffices to say that for queries on protein sequences, the most important queries are probably the ‘sequence’ to which a reference DAS server responds with the full sequence and ‘features’ to which reference and annotation servers respond with feature annotations they store for a specified sequence identifier. An example query to the CBS DAS server is shown below.

SERVER INFRASTRUCTURE

At CBS, we have implemented a Perl-based DAS server, ProServer (), which accepts queries at the address: . We provide annotations for several of CBS's protein sequence annotation servers, which predict protein sorting [LipoP (6), NetNES (7), SignalP (8), SecretomeP (9), TargetP (10)], protein post-translational modification [NetAcet (11), NetPhos (12), NetOGlyc (13), NetNGlyc, ProP (14)] and protein structure and function [TMHMM (15)]. Statistics and data source names (DSNs) for the individual methods are shown in Table 1. The annotations provided by the DAS server include: the start and end position of the feature annotated; the score from the prediction method that assigned the feature; a hyperlink to the web page of the prediction method with sequence information preloaded in the form input and possibly some further information.

Table 1

Annotation methods provided by the CBS DAS system

Method	Data source name	Organism coverage	Number of records	Reference
LipoP-1.0	lipop	G^neg	7 597	(6)
NetAcet-1.0	netacet	E	122 664	(7)
NetNES-1.1	netnes	E	1 945 054	(11)
NetNGlyc-1.0	netnglyc	H	137 800
NetOGlyc-3.1	netoglyc	M	81 310	(13)
NetPhos-2.0	netphos	E	8 940 654	(12)
ProP-1.0	prop	E	127 553	(14)
SecretomeP-1.0	secretomep	E	58 318	(9)
SignalP-3.0	signalp	E, G^pos, G^neg	1 189 706	(8)
TargetP-1.01	targetp	E	750 111	(10)
TMHMM-2.0	tmhmm	A	5 086 476	(15)
All the above combined	cbs_total		18 447 243

The annotation methods are specific to the following phylogenetic groups: ‘A’ stands for all proteins, ‘E’ for eukaryotes, ‘Gpos’ for Gram-positive bacteria, ‘Gneg’ for Gram-negative bacteria, ‘H’ for human and ‘M’ for mammals. The data source name is the name of the particular annotation method on the DAS server.

In general, the annotations span all of UniProt (4), but are limited to phylogenetic subsets of the database, as the annotation methods are usually constructed with a specific phylogenetic group as a target (see the reference for each server for details). Currently, the CBS DAS servers provide over 18 million protein annotations for over 1.5 million protein sequences from the UniProt database and we hope that this wide coverage makes our services of general interest to the scientific community. The predicted annotations include several highly cited methods, e.g. SignalP and NetPhos, which are among the top 1% of the most cited papers in the scientific literature according to the Institute for Scientific Information. The annotations are precalculated and the results stored in a relational database, allowing for fast retrieval and update of data. Regarding the terminology of the predicted features, we have generally used the nomenclature of the original prediction method. In some cases, we have modified the feature names to mimic the UniProt feature table, thus reflecting the reference database structure, allowing for easy comparison between the reference UniProt server and other annotation resources. It is quite conceivable that the vocabulary will be updated at a later point to make use of standard ontologies such as the Gene Ontology (GO) (16), so that post-translational modifications would be mapped onto GO ‘biological process’, etc. The concept of the Sequence Ontology (SO) () is highly relevant to this project, however the SO does not yet provide sufficient coverage of protein sequence attributes, such as post-translational modification, to be useful for our purposes.

A query example

When querying a DAS server for annotation, one must append the DSN, along with a query type and a sequence identifier to the address of the server. For example, if we wish to ask for annotations from the SignalP signal peptide prediction method (8) for the protein EGFR_HUMAN we first append the DSN for that method (‘signalp’, Table 1). Then we use the ‘features’ query to ask for feature annotations and identify the sequence as a ‘segment’. The whole query string thus looks like this: .

CBS DAS VIEWER

As the raw XML output of DAS servers is not very suitable for browsing of feature annotations, we have developed a client viewer to allow visualization of CBS DAS annotations in a simple graphical way. This viewer is publicly available at . All the user is required to do is to input a UniProt accession number or identifier. The viewer then collects the annotations provided by the CBS DAS servers, along with annotations from a UniProt reference DAS server at the European Bioinformatics Institute () for that particular sequence. All the annotations are then displayed as aligned graphical tracks, allowing for easy inspection of features along the length of the protein. Additional information about the annotations is shown in a pop-up window when the user points the mouse to an annotation track. This is the first time CBS has provided a composite graphical display of several of its protein prediction methods simultaneously, which the users of CBS prediction services may find interesting. Some types of feature annotations carry a hyperlink in the XML payload. When the user clicks on a graphical track for such an annotation, the CBS DAS protein viewer will open a new browser window, following the hyperlink. The graphical tracks can also be folded and expanded to allow simplified overview. A screenshot of the client in action can be seen in Figure 1. The client demonstrates how easily different data sources can be integrated using the DAS. We plan to incorporate relevant DAS protein annotation resources into the graphical client as they appear. At the time of writing, only one external DAS source was incorporated in the view; a resource where RCSB Protein Data Bank (17) structures are aligned upon UniProt entries, provided by the Sanger Institute ().

Figure 1

The CBS protein DAS viewer. The browser interface is very simple, it has only one form field and the graphical tracks show the annotations for a given UniProt protein. Additional information for individual features is shown in a pop up help window when the mouse is pointed at the feature.

17 in total

1. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.

Authors: N Blom; S Gammeltoft; S Brunak
Journal: J Mol Biol Date: 1999-12-17 Impact factor: 5.469

2. Prediction of lipoprotein signal peptides in Gram-negative bacteria.

Authors: Agnieszka S Juncker; Hanni Willenbrock; Gunnar Von Heijne; Søren Brunak; Henrik Nielsen; Anders Krogh
Journal: Protein Sci Date: 2003-08 Impact factor: 6.725

3. Improved prediction of signal peptides: SignalP 3.0.

Authors: Jannick Dyrløv Bendtsen; Henrik Nielsen; Gunnar von Heijne; Søren Brunak
Journal: J Mol Biol Date: 2004-07-16 Impact factor: 5.469

4. Analysis and prediction of leucine-rich nuclear export signals.

Authors: Tanja la Cour; Lars Kiemer; Anne Mølgaard; Ramneek Gupta; Karen Skriver; Søren Brunak
Journal: Protein Eng Des Sel Date: 2004-08-16 Impact factor: 1.650

5. Prediction of proprotein convertase cleavage sites.

Authors: Peter Duckert; Søren Brunak; Nikolaj Blom
Journal: Protein Eng Des Sel Date: 2004-01 Impact factor: 1.650

6. NetAcet: prediction of N-terminal acetylation sites.

Authors: Lars Kiemer; Jannick Dyrløv Bendtsen; Nikolaj Blom
Journal: Bioinformatics Date: 2004-11-11 Impact factor: 6.937

7. A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors: E L Sonnhammer; G von Heijne; A Krogh
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1998

8. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors: O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal: J Mol Biol Date: 2000-07-21 Impact factor: 5.469

9. Feature-based prediction of non-classical and leaderless protein secretion.

Authors: Jannick Dyrløv Bendtsen; Lars Juhl Jensen; Nikolaj Blom; Gunnar Von Heijne; Søren Brunak
Journal: Protein Eng Des Sel Date: 2004-04-28 Impact factor: 1.650

10. The distributed annotation system.

Authors: R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal: BMC Bioinformatics Date: 2001-10-10 Impact factor: 3.169

11 in total

1. Bringing Web 2.0 to bioinformatics.

Authors: Zhang Zhang; Kei-Hoi Cheung; Jeffrey P Townsend
Journal: Brief Bioinform Date: 2008-10-08 Impact factor: 11.622

2. Software tool for researching annotations of proteins: open-source protein annotation software with data visualization.

Authors: Vivek N Bhatia; David H Perlman; Catherine E Costello; Mark E McComb
Journal: Anal Chem Date: 2009-12-01 Impact factor: 6.986

3. ProServer: a simple, extensible Perl DAS server.

Authors: Robert D Finn; James W Stalker; David K Jackson; Eugene Kulesha; Jody Clements; Roger Pettett
Journal: Bioinformatics Date: 2007-01-18 Impact factor: 6.937

4. Snap: an integrated SNP annotation platform.

Authors: Shengting Li; Lijia Ma; Heng Li; Søren Vang; Yafeng Hu; Lars Bolund; Jun Wang
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

5. Integrating biological data--the Distributed Annotation System.

Authors: Andrew M Jenkinson; Mario Albrecht; Ewan Birney; Hagen Blankenburg; Thomas Down; Robert D Finn; Henning Hermjakob; Tim J P Hubbard; Rafael C Jimenez; Philip Jones; Andreas Kähäri; Eugene Kulesha; José R Macías; Gabrielle A Reeves; Andreas Prlić
Journal: BMC Bioinformatics Date: 2008-07-22 Impact factor: 3.169

6. Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX.

Authors: Alvaro Martínez Barrio; Erik Lagercrantz; Göran O Sperber; Jonas Blomberg; Erik Bongcam-Rudloff
Journal: BMC Bioinformatics Date: 2009-06-16 Impact factor: 3.169