Literature DB >> 17237073

ProServer: a simple, extensible Perl DAS server.

Robert D Finn¹, James W Stalker, David K Jackson, Eugene Kulesha, Jody Clements, Roger Pettett.

Abstract

SUMMARY: The increasing size and complexity of biological databases has led to a growing trend to federate rather than duplicate them. In order to share data between federated databases, protocols for the exchange mechanism must be developed. One such data exchange protocol that is widely used is the Distributed Annotation System (DAS). For example, DAS has enabled small experimental groups to integrate their data into the Ensembl genome browser. We have developed ProServer, a simple, lightweight, Perl-based DAS server that does not depend on a separate HTTP server. The ProServer package is easily extensible, allowing data to be served from almost any underlying data model. Recent additions to the DAS protocol have enabled both structure and alignment (sequence and structural) data to be exchanged. ProServer allows both of these data types to be served. AVAILABILITY: ProServer can be downloaded from http://www.sanger.ac.uk/proserver/ or CPAN http://search.cpan.org/~rpettett/. Details on the system requirements and installation of ProServer can be found at http://www.sanger.ac.uk/proserver/.

Entities: Chemical

Mesh：

Year: 2007 PMID： 17237073 PMCID： PMC2989875 DOI： 10.1093/bioinformatics/btl650

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

High-throughput projects, such as the sequencing of the human genome, have resulted in a deluge of data. Thus, a key challenge in modern bioinformatics is to put in place mechanisms for programmatic exchange of and access to large volumes of data from disparate resources. As biological databases increase in size and complexity, the classical mechanism of data exchange, database duplication, can become impractical. For example, the underlying database for Ensembl release 41 is over 700GB, containing data on 25 different genomes, spread across hundreds of tables. An alternative to duplication of databases is to interlink the distributed resources, termed federation. However, for data exchange to take place between federated databases, a mechanism and standardization of format must be agreed. One such protocol for data exchange over a network is the Distributed Annotation System (DAS). Briefly, the DAS protocol standardizes queries and responses for DNA or protein sequences, along with their annotations, regardless of the underlying data architecture (Dowell et al., 2001). In common with other Web Services, client requests are made via HTTP to servers which process them and return results encapsulated in XML. Since the original DAS specification was developed, DAS has been used extensively for the annotation of genome (DNA) sequences and, more recently, of proteins. In addition, many extensions to the DAS protocol have been proposed. Two of these extensions, developed over the past year, allow the exchange of sequence alignments and protein structure data. These extensions enable annotations to be integrated across DNA, protein sequence and protein structure. We have developed ProServer, a simple, user-friendly package for making data available using the DAS protocol.

2 PROSERVER

ProServer has been available from the Wellcome Trust Sanger Institute since 2003. We have recently added additional functionality and documentation to improve usability and robustness.

2.1 Architecture and flexibility

ProServer is a standalone, lightweight DAS server, written in Perl and designed to have low system requirements. The architecture of ProServer is represented in Figure 1. At the top level, there is a daemon executable which acts as a broker between requests and the code that will handle them. The server is configured using a ‘.ini’-format configuration file holding settings for data sources provided by the server (Fig. 1). The networking requirements of ProServer are handled by the POE package (http://poe.perl.org/), which implements a portable multitasking and networking framework for Perl.

Fig. 1.

A schematic representation of the ProServer architecture. (See text for details.)

A schematic representation of the ProServer architecture. (See text for details.) The server sits upon a number of source adaptors with each one dedicated to a single underlying data store. Incoming queries are passed from the daemon to the appropriate source adaptor. The source adaptors handle access to the data store and the conversion of queries into generic DAS response data structures. Linking source adaptors to their data stores are generic transport helpers which are responsible for handling data acquisition (Fig. 1). Depending on the format of any new data set, it may be necessary to implement a new transport helper (Fig. 1). However, ProServer comes bundled with helpers for some common data stores, for example: flatfile, GFF. MySQL, Oracle and SRS getz. These transport helpers are all simple, command-line or socket-handling modules. How are new data sources made available using ProServer? In order to expose different sources of data, a new source adaptor class must be written. All such source adaptors inherit and extend generic functionality, namely the data retrieval methods that are applicable for the new data set (e.g. features or sequence). The superclass transparently handles the transformation of data to XML (Fig. 1). Finally, the details of the new source adaptor are entered into the ProServer configuration file. Specific details regarding installation, security (and best practice usage) and scalability can be found in the ProServer README file.

2.2 Other DAS servers

Two alternative systems for creating DAS servers are Dazzle (http://www.derkholm.net/thomas/dazzle/) and the Lightweight Distributed Annotation Server (LDAS, http://biodas.org/servers/LDAS.html). Dazzle is written in Java, whereas LDAS is written in Perl. Dazzle has comparable functionality to ProServer, whereas LDAS does not have alignment or structure capabilities. Both of these alternatives require an available web server (e.g. Apache) and time-consuming configuration. Despite the fact that ProServer does not use a web server such as Apache, our experience indicates that the main bottleneck is input–output rather than computation. Thus, providing the underlying data store is organized optimally (as it would need to be with the alternative servers), then similar performance and scalability should be achievable regardless of the DAS server used. The simplicity of ProServer means that it can be deployed by users with only basic bioinformatics skills. Therefore, groups can expose their data set, whether large or small, to the scientific community with little additional overhead to the storage of the data set.

2.3 Examples

ProServer was originally developed in conjunction with the Ensembl project (Birney et al., 2006), to display features such as gene predictions on chromosomes. Since the original release of ProServer, with only features and sequence capabilities, we have added both alignment and structure functionality. This has enabled the Pfam database (Finn et al., 2006) to provide access to their protein sequence alignments using ProServer to serve data directly from the underlying MySQL database (see http://das.sanger.ac.uk/das/pfamAlign). Other projects outside of the Wellcome Trust Sanger Institute are already using ProServer for the exchange of feature annotations: examples include Gene3D (Yeast et al., 2006), CBS (Ólason, 2005) and AnoEST (Kriventseva et al., 2005). The supplementary materials contain a list of some of the most popular clients that provide an insight to the use of DAS within the scientific community.

3 CONCLUSIONS

The federation of databases removes the effort involved with their duplication and maintenance and the problem of asynchronous versions. The relatively simple architecture of ProServer means that even small groups can make their data available via DAS, even if it is stored only as a flatfile, without the overheads of running a web server. Once available, this resource can be readily integrated into other data sets. For example, the Ensembl browser (Birney et al., 2006) allows new DAS sources to be added and displayed alongside existing annotation, even if they are only internally available at an institute. Rare chromosomal abnormality data from the DECIPHER project (http://decipher.sanger.ac.uk/syndromes) are being displayed in a genomic context via the Ensembl browser using ProServer, for example, the 1p36 microdeletion (http://www.sanger.ac.uk/turl/72d). A survey of the DAS registry (Prlic et al., 2006) (http://www.dasregistry.org/) demonstrated that 69 of the 103 registered feature servers were using ProServer at the time of writing. Thus, ProServer is already being widely used to provide data via the DAS protocol. The recent extensions increase the range of data types available via DAS, allowing the transfer of annotations between aligned objects. These extensions are leading to the development of exciting new clients that bring together different data types (Prlic A et al., 2005). ProServer's minimal requirements and simplicity, together with the widespread use of DAS in bioinformatics makes ProServer a valuable tool for publishing data in a common, standard format. The accessibility and programmatic integration of distributed resources into state of the art tools and websites is accelerating novel scientific discoveries.

4 AVAILABILITY

ProServer is available from http://www.sanger.ac.uk/proserver/ or CPAN http://search.cpan.org/~rpettett/ and has been tested on Tru64, Linux and Mac OS X architectures running Perl 5.6.1 and above. ProServer has also been run under Windows using the Cygwin environment.

7 in total

1. Adding some SPICE to DAS.

Authors: Andreas Prlić; Thomas A Down; Tim J P Hubbard
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

2. AnoEST: toward A. gambiae functional genomics.

Authors: Evgenia V Kriventseva; Anastasios C Koutsos; Claudia Blass; Fotis C Kafatos; George K Christophides; Evgeny M Zdobnov
Journal: Genome Res Date: 2005-05-17 Impact factor: 9.043

3. Ensembl 2006.

Authors: E Birney; D Andrews; M Caccamo; Y Chen; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; P Flicek; S Gräf; M Hammond; J Herrero; K Howe; V Iyer; K Jekosch; A Kähäri; A Kasprzyk; D Keefe; F Kokocinski; E Kulesha; D London; I Longden; C Melsopp; P Meidl; B Overduin; A Parker; G Proctor; A Prlic; M Rae; D Rios; S Redmond; M Schuster; I Sealy; S Searle; J Severin; G Slater; D Smedley; J Smith; A Stabenau; J Stalker; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; T J P Hubbard
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

4. Gene3D: modelling protein structure, function and evolution.

Authors: Corin Yeats; Michael Maibaum; Russell Marsden; Mark Dibley; David Lee; Sarah Addou; Christine A Orengo
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

5. Pfam: clans, web tools and services.

Authors: Robert D Finn; Jaina Mistry; Benjamin Schuster-Böckler; Sam Griffiths-Jones; Volker Hollich; Timo Lassmann; Simon Moxon; Mhairi Marshall; Ajay Khanna; Richard Durbin; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. Integrating protein annotation resources through the Distributed Annotation System.

Authors: Páll Isólfur Olason
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

7. The distributed annotation system.

Authors: R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal: BMC Bioinformatics Date: 2001-10-10 Impact factor: 3.169

7 in total

23 in total

1. Updating annotations with the distributed annotation system and the automated sequence annotation pipeline.

Authors: William Speier; Michael F Ochs
Journal: Bioinformatics Date: 2012-09-03 Impact factor: 6.937

2. Collaboratively charting the gene-to-phenotype network of human congenital heart defects.

Authors: Roland Barriot; Jeroen Breckpot; Bernard Thienpont; Sylvain Brohée; Steven Van Vooren; Bert Coessens; Leon-Charles Tranchevent; Peter Van Loo; Marc Gewillig; Koenraad Devriendt; Yves Moreau
Journal: Genome Med Date: 2010-03-01 Impact factor: 11.117

3. aGEM: an integrative system for analyzing spatial-temporal gene-expression information.

Authors: Natalia Jiménez-Lozano; Joan Segura; José Ramón Macías; Juanjo Vega; José María Carazo
Journal: Bioinformatics Date: 2009-07-09 Impact factor: 6.937

4. MaizeGDB becomes 'sequence-centric'.

Authors: Taner Z Sen; Carson M Andorf; Mary L Schaeffer; Lisa C Harper; Michael E Sparks; Jon Duvick; Volker P Brendel; Ethalinda Cannon; Darwin A Campbell; Carolyn J Lawrence
Journal: Database (Oxford) Date: 2009-12-07 Impact factor: 3.451

5. PROSITE, a protein domain database for functional characterization and annotation.

Authors: Christian J A Sigrist; Lorenzo Cerutti; Edouard de Castro; Petra S Langendijk-Genevaux; Virginie Bulliard; Amos Bairoch; Nicolas Hulo
Journal: Nucleic Acids Res Date: 2009-10-25 Impact factor: 16.971

6. MAPU 2.0: high-accuracy proteomes mapped to genomes.

Authors: Florian Gnad; Mario Oroshi; Ewan Birney; Matthias Mann
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

7. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions.

Authors: Jessica Severin; Andrew M Waterhouse; Hideya Kawaji; Timo Lassmann; Erik van Nimwegen; Piotr J Balwierz; Michiel Jl de Hoon; David A Hume; Piero Carninci; Yoshihide Hayashizaki; Harukazu Suzuki; Carsten O Daub; Alistair Rr Forrest
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

8. Systems-wide analysis of a phosphatase knock-down by quantitative proteomics and phosphoproteomics.

Authors: Maximiliane Hilger; Tiziana Bonaldi; Florian Gnad; Matthias Mann
Journal: Mol Cell Proteomics Date: 2009-05-09 Impact factor: 5.911

9. Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX.

Authors: Alvaro Martínez Barrio; Erik Lagercrantz; Göran O Sperber; Jonas Blomberg; Erik Bongcam-Rudloff
Journal: BMC Bioinformatics Date: 2009-06-16 Impact factor: 3.169

10. DASMiner: discovering and integrating data from DAS sources.

Authors: Diogo F T Veiga; Helena F Deus; Caner Akdemir; Ana Tereza R Vasconcelos; Jonas S Almeida
Journal: BMC Syst Biol Date: 2009-11-17