Literature DB >> 15980514

Integrating protein annotation resources through the Distributed Annotation System.

Páll Isólfur Olason1.   

Abstract

Using the Distributed Annotation System (DAS) we have created a protein annotation resource available at our web page: http://www.cbs.dtu.dk, as a part of the BioSapiens Network of Excellence EU FP6 project. The DAS protocol allows us to gather layers of annotation data for a given sequence and thereby gain an overview of the sequence's features. A user-friendly graphical client has also been developed (http://www.cbs.dtu.dk/cgi-bin/das), which demonstrates the possibility of integrating DAS annotation data from multiple sources into a simple graphical view. The client displays protein feature annotations from the Center for Biological Sequence Analysis as well as from the BioSapiens reference UniProt server (http://www.ebi.ac.uk/das-srv/uniprot/das) at the European Bioinformatics Institute. Other DAS data sources for protein annotation will be added as they become available.

Entities:  

Mesh:

Year:  2005        PMID: 15980514      PMCID: PMC1160224          DOI: 10.1093/nar/gki463

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

In recent years, numerous computational tools for gene and protein analysis have been constructed by various laboratories. Several such analysis tools have been created and published by the Center for Biological Sequence Analysis (CBS), many of which are available online for all users at the CBS web page: . The analysis results of such tools have led to an explosion in the amount of data in biological databases and available information that exists for biological sequences. Today, one of the major tasks of systems biology is to integrate as much of the experimental and computational information as possible and thereby gain biological insight into the properties and function of the macromolecules under observation. This means the integration of several types of data, in various formats, dispersed around the face of the globe into a unified structure. This integration of online annotations is greatly simplified if the annotation services follow accepted standards. One such standard is the Distributed Annotation System (DAS) (1). DAS services have existed for several years now. Version 1.0 of the DAS specification was released in 2001 and version 2 is under development. The DAS protocol is a simple http-based client-server system. A query in the form of a URL is made to the server, which replies with annotations for the sequence entry specified in the URL query. The reply from the server is in XML format. The DAS web page () has both Perl- and Java-based server software for download. Client libraries in Perl and Java are also available. The DAS specification was originally written with genomic sequences in mind, but the standard has proven itself flexible enough to handle protein data as well. Several annotation databases are now serving annotations using the DAS system, including Ensembl (2), FlyBase (3), UniProt (4) and WormBase (5). The flexibility and success of the DAS protocol has made it the annotation method of choice for the BioSapiens Network of Excellence, of which the CBS DAS server detailed here is a part. The various consortium members will in the near future deploy several DAS servers, which will serve protein annotations for the same UniProt sequences as the DAS server at CBS and all the data can therefore be easily integrated in a coherent manner. The full list of query types that the DAS specification supports is beyond the scope of this document. We refer readers to the DAS web page and specification for detailed information and it suffices to say that for queries on protein sequences, the most important queries are probably the ‘sequence’ to which a reference DAS server responds with the full sequence and ‘features’ to which reference and annotation servers respond with feature annotations they store for a specified sequence identifier. An example query to the CBS DAS server is shown below.

SERVER INFRASTRUCTURE

At CBS, we have implemented a Perl-based DAS server, ProServer (), which accepts queries at the address: . We provide annotations for several of CBS's protein sequence annotation servers, which predict protein sorting [LipoP (6), NetNES (7), SignalP (8), SecretomeP (9), TargetP (10)], protein post-translational modification [NetAcet (11), NetPhos (12), NetOGlyc (13), NetNGlyc, ProP (14)] and protein structure and function [TMHMM (15)]. Statistics and data source names (DSNs) for the individual methods are shown in Table 1. The annotations provided by the DAS server include: the start and end position of the feature annotated; the score from the prediction method that assigned the feature; a hyperlink to the web page of the prediction method with sequence information preloaded in the form input and possibly some further information.
Table 1

Annotation methods provided by the CBS DAS system

MethodData source nameOrganism coverageNumber of recordsReference
LipoP-1.0lipopGneg7 597(6)
NetAcet-1.0netacetE122 664(7)
NetNES-1.1netnesE1 945 054(11)
NetNGlyc-1.0netnglycH137 800
NetOGlyc-3.1netoglycM81 310(13)
NetPhos-2.0netphosE8 940 654(12)
ProP-1.0propE127 553(14)
SecretomeP-1.0secretomepE58 318(9)
SignalP-3.0signalpE, Gpos, Gneg1 189 706(8)
TargetP-1.01targetpE750 111(10)
TMHMM-2.0tmhmmA5 086 476(15)
All the above combinedcbs_total18 447 243

The annotation methods are specific to the following phylogenetic groups: ‘A’ stands for all proteins, ‘E’ for eukaryotes, ‘Gpos’ for Gram-positive bacteria, ‘Gneg’ for Gram-negative bacteria, ‘H’ for human and ‘M’ for mammals. The data source name is the name of the particular annotation method on the DAS server.

In general, the annotations span all of UniProt (4), but are limited to phylogenetic subsets of the database, as the annotation methods are usually constructed with a specific phylogenetic group as a target (see the reference for each server for details). Currently, the CBS DAS servers provide over 18 million protein annotations for over 1.5 million protein sequences from the UniProt database and we hope that this wide coverage makes our services of general interest to the scientific community. The predicted annotations include several highly cited methods, e.g. SignalP and NetPhos, which are among the top 1% of the most cited papers in the scientific literature according to the Institute for Scientific Information. The annotations are precalculated and the results stored in a relational database, allowing for fast retrieval and update of data. Regarding the terminology of the predicted features, we have generally used the nomenclature of the original prediction method. In some cases, we have modified the feature names to mimic the UniProt feature table, thus reflecting the reference database structure, allowing for easy comparison between the reference UniProt server and other annotation resources. It is quite conceivable that the vocabulary will be updated at a later point to make use of standard ontologies such as the Gene Ontology (GO) (16), so that post-translational modifications would be mapped onto GO ‘biological process’, etc. The concept of the Sequence Ontology (SO) () is highly relevant to this project, however the SO does not yet provide sufficient coverage of protein sequence attributes, such as post-translational modification, to be useful for our purposes.

A query example

When querying a DAS server for annotation, one must append the DSN, along with a query type and a sequence identifier to the address of the server. For example, if we wish to ask for annotations from the SignalP signal peptide prediction method (8) for the protein EGFR_HUMAN we first append the DSN for that method (‘signalp’, Table 1). Then we use the ‘features’ query to ask for feature annotations and identify the sequence as a ‘segment’. The whole query string thus looks like this: .

CBS DAS VIEWER

As the raw XML output of DAS servers is not very suitable for browsing of feature annotations, we have developed a client viewer to allow visualization of CBS DAS annotations in a simple graphical way. This viewer is publicly available at . All the user is required to do is to input a UniProt accession number or identifier. The viewer then collects the annotations provided by the CBS DAS servers, along with annotations from a UniProt reference DAS server at the European Bioinformatics Institute () for that particular sequence. All the annotations are then displayed as aligned graphical tracks, allowing for easy inspection of features along the length of the protein. Additional information about the annotations is shown in a pop-up window when the user points the mouse to an annotation track. This is the first time CBS has provided a composite graphical display of several of its protein prediction methods simultaneously, which the users of CBS prediction services may find interesting. Some types of feature annotations carry a hyperlink in the XML payload. When the user clicks on a graphical track for such an annotation, the CBS DAS protein viewer will open a new browser window, following the hyperlink. The graphical tracks can also be folded and expanded to allow simplified overview. A screenshot of the client in action can be seen in Figure 1. The client demonstrates how easily different data sources can be integrated using the DAS. We plan to incorporate relevant DAS protein annotation resources into the graphical client as they appear. At the time of writing, only one external DAS source was incorporated in the view; a resource where RCSB Protein Data Bank (17) structures are aligned upon UniProt entries, provided by the Sanger Institute ().
Figure 1

The CBS protein DAS viewer. The browser interface is very simple, it has only one form field and the graphical tracks show the annotations for a given UniProt protein. Additional information for individual features is shown in a pop up help window when the mouse is pointed at the feature.

  17 in total

1.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.

Authors:  N Blom; S Gammeltoft; S Brunak
Journal:  J Mol Biol       Date:  1999-12-17       Impact factor: 5.469

2.  Prediction of lipoprotein signal peptides in Gram-negative bacteria.

Authors:  Agnieszka S Juncker; Hanni Willenbrock; Gunnar Von Heijne; Søren Brunak; Henrik Nielsen; Anders Krogh
Journal:  Protein Sci       Date:  2003-08       Impact factor: 6.725

3.  Improved prediction of signal peptides: SignalP 3.0.

Authors:  Jannick Dyrløv Bendtsen; Henrik Nielsen; Gunnar von Heijne; Søren Brunak
Journal:  J Mol Biol       Date:  2004-07-16       Impact factor: 5.469

4.  Analysis and prediction of leucine-rich nuclear export signals.

Authors:  Tanja la Cour; Lars Kiemer; Anne Mølgaard; Ramneek Gupta; Karen Skriver; Søren Brunak
Journal:  Protein Eng Des Sel       Date:  2004-08-16       Impact factor: 1.650

5.  Prediction of proprotein convertase cleavage sites.

Authors:  Peter Duckert; Søren Brunak; Nikolaj Blom
Journal:  Protein Eng Des Sel       Date:  2004-01       Impact factor: 1.650

6.  NetAcet: prediction of N-terminal acetylation sites.

Authors:  Lars Kiemer; Jannick Dyrløv Bendtsen; Nikolaj Blom
Journal:  Bioinformatics       Date:  2004-11-11       Impact factor: 6.937

7.  A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors:  E L Sonnhammer; G von Heijne; A Krogh
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  1998

8.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors:  O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal:  J Mol Biol       Date:  2000-07-21       Impact factor: 5.469

9.  Feature-based prediction of non-classical and leaderless protein secretion.

Authors:  Jannick Dyrløv Bendtsen; Lars Juhl Jensen; Nikolaj Blom; Gunnar Von Heijne; Søren Brunak
Journal:  Protein Eng Des Sel       Date:  2004-04-28       Impact factor: 1.650

10.  The distributed annotation system.

Authors:  R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal:  BMC Bioinformatics       Date:  2001-10-10       Impact factor: 3.169

View more
  11 in total

1.  Bringing Web 2.0 to bioinformatics.

Authors:  Zhang Zhang; Kei-Hoi Cheung; Jeffrey P Townsend
Journal:  Brief Bioinform       Date:  2008-10-08       Impact factor: 11.622

2.  Software tool for researching annotations of proteins: open-source protein annotation software with data visualization.

Authors:  Vivek N Bhatia; David H Perlman; Catherine E Costello; Mark E McComb
Journal:  Anal Chem       Date:  2009-12-01       Impact factor: 6.986

3.  ProServer: a simple, extensible Perl DAS server.

Authors:  Robert D Finn; James W Stalker; David K Jackson; Eugene Kulesha; Jody Clements; Roger Pettett
Journal:  Bioinformatics       Date:  2007-01-18       Impact factor: 6.937

4.  Snap: an integrated SNP annotation platform.

Authors:  Shengting Li; Lijia Ma; Heng Li; Søren Vang; Yafeng Hu; Lars Bolund; Jun Wang
Journal:  Nucleic Acids Res       Date:  2006-11-29       Impact factor: 16.971

5.  Integrating biological data--the Distributed Annotation System.

Authors:  Andrew M Jenkinson; Mario Albrecht; Ewan Birney; Hagen Blankenburg; Thomas Down; Robert D Finn; Henning Hermjakob; Tim J P Hubbard; Rafael C Jimenez; Philip Jones; Andreas Kähäri; Eugene Kulesha; José R Macías; Gabrielle A Reeves; Andreas Prlić
Journal:  BMC Bioinformatics       Date:  2008-07-22       Impact factor: 3.169

6.  Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX.

Authors:  Alvaro Martínez Barrio; Erik Lagercrantz; Göran O Sperber; Jonas Blomberg; Erik Bongcam-Rudloff
Journal:  BMC Bioinformatics       Date:  2009-06-16       Impact factor: 3.169

7.  DASMiner: discovering and integrating data from DAS sources.

Authors:  Diogo F T Veiga; Helena F Deus; Caner Akdemir; Ana Tereza R Vasconcelos; Jonas S Almeida
Journal:  BMC Syst Biol       Date:  2009-11-17

8.  Integrating T-cell epitope annotations with sequence and structural information using DAS.

Authors:  Carmen M Diez-Rivero; María García-Boronat; Pedro A Reche
Journal:  Bioinformation       Date:  2008-12-06

9.  Integrating sequence and structural biology with DAS.

Authors:  Andreas Prlić; Thomas A Down; Eugene Kulesha; Robert D Finn; Andreas Kähäri; Tim J P Hubbard
Journal:  BMC Bioinformatics       Date:  2007-09-12       Impact factor: 3.169

10.  MSDmotif: exploring protein sites and motifs.

Authors:  Adel Golovin; Kim Henrick
Journal:  BMC Bioinformatics       Date:  2008-07-17       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.