Literature DB >> 19420058

BioMart Central Portal--unified access to biological data.

Syed Haider1, Benoit Ballester, Damian Smedley, Junjun Zhang, Peter Rice, Arek Kasprzyk.   

Abstract

BioMart Central Portal (www.biomart.org) offers a one-stop shop solution to access a wide array of biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE; for a complete list, visit, http://www.biomart.org/biomart/martview. Moreover, the web server features seamless data federation making cross querying of these data sources in a user friendly and unified way. The web server not only provides access through a web interface (MartView), it also supports programmatic access through a Perl API as well as RESTful and SOAP oriented web services. The website is free and open to all users and there is no login requirement.

Entities:  

Mesh:

Year:  2009        PMID: 19420058      PMCID: PMC2703988          DOI: 10.1093/nar/gkp265

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The advancements in sequencing technologies and subsequent growth in the repertoire of biological information are posing serious data-management challenges. The volume of these data is expected to continue to grow exponentially. Projects such as GenBank (1), HapMap (2) and the SNP Consortium are prime examples of the high-throughput data-management challenges that we are experiencing. Querying different biological data sources in an integrated manner generally involves moving all the data into a centralized data warehouse, necessitating substantial resources for keeping it up to date with component data sources. New generation sequencing projects such as the 1000 Genomes Project and International Cancer Genome Consortium (ICGC) are expected to produce data on an unprecedented scale. Moving this type of data into a central location for integrated querying with other resources presents considerable organizational and physical transfer challenges. One solution to this challenge lies in federated databases whereby individual data providers are responsible for updates and release cycles. The federated model eliminates the need to aggregate and manage all the data in any one central location. Another dimension of this problem is the provision of fast and robust access to such large quantities of data; how do we bring this data to end-users without having to expose any of the back-end issues pertaining to discovering repository location, information retrieval and merging with other datasets to support cross querying which is often the case in biological queries. Lastly, the results to be returned from these databases must be in standard formats and where possible, semantically annotated to ensure interoperability with other databases and tools. The Distributed Annotation System (DAS) (3) as well as BioMart (4) are functional examples of such frameworks. The BioMart software system offers a generic framework for biological data storage and retrieval particularly suited for large scale ‘omics data through a single point of access. The web server, BioMart Central Portal, provides access to variety of datasets that can be queried independently or in a federated way enabling users to ask complex questions over data sources that may be located at different geographical locations. These inculde Ensembl genomic, Uniprot protein, Reactome pathway, HGNC gene name, Wormbase genomic and PRIDE proteomic data (5–10). As of March 2009, BioMart Central Portal brings together an extensive range of databases (see Figure 1), serving more than 100 datasets with an average monthly usage of over 1 million server hits (see Supplementary Table S1). Furthermore, the web server provides complete access to metadata that can be used by third party client writers to emulate functionality offered by the BioMart Central Portal as per their domain requirements. We believe that this service will be of enormous benefit to many users and deployers ranging from wet-lab biologists to computer scientists working in bioinformatics setups.
Figure 1.

List of databases available through BioMart Central Portal (March 2009).

List of databases available through BioMart Central Portal (March 2009).

BIOMART CENTRAL PORTAL

The BioMart Central Portal is a web server interface of BioMart software and provides a unified view over disparate data sources that enable bioscientists to retrieve data from one or multiple sources in a simple and efficient way. The library behind the web server handles user request and takes over the responsibility of fetching data from respective locations, aggregating results and subsequent formatting in the specified format. Figure 2 describes the high-level system architecture and the data flow. A query to the BioMart Central Portal primarily consists of three simple abstractions (Dataset, Filters and Attributes). Dataset being the logical boundary of the query, Filters (optional) are the inputs and Attributes are the user specified outputs. The BioMart Central Portal handles queries from several interfaces, all utilizing these three abstractions in a coherent way across all interfaces. These interfaces are:
Figure 2.

The schematic representation of BioMart Central Portal.

Perl API Web interface (MartView) URL based access RESTful web service (MartService) SOAP web service (MartServiceSoap) DAS server The schematic representation of BioMart Central Portal. All the query interfaces are written in Perl. A detailed description of usage and query formulation is explained in (11) and the project docs available at www.biomart.org/install.html. In the sections to follow, we will describe the access to BioMart Central Portal through its web service end-point, MartServiceSoap. The BioMart queries can be fundamentally categorized into two types; metadata and data access. A machine readable XML based description of inputs and outputs of these queries are published in Web Service Definition Language (WSDL) and XML Schema Definition (XSD) files available at http://www.biomart.org/biomart/martwsdl and http://www.biomart.org/biomart/martxsd.

Metadata Access

These requests are used to retrieve information about which databases, datasets, filters, attributes and associated formatters are made available by BioMart Central Portal. These queries support not only programmatic access, they also return additional information which may be used to write domain specific specialized clients to access BioMart Central Portal remotely. These requests are described as follows:

getRegistry

This request retrieves information contents such as name, location, host, port etc about all the databases/marts available at BioMart Central Portal. The output is equivalent to the list displayed by MartView, see Figure 1.

getDatasets

This request retrieves a list of datasets available under each mart, mart name being the input of the request.

getFilters and getAttributes

These two requests retrieve a list of all the filters and attributes available given a dataset. Additional information about hierarchy, limitations and output formatters is also returned. Most importantly, the W3C suggested property ‘modelReference’ in the output, if configured by the data publisher, provides the Uniform Resource Identifier (URI) of the concept in an ontology that contains description of the output attribute/s. This feature offers a framework for semantic annotation of terms in BioMart databases. This feature will improve interoperability of BioMart results with non-BioMart data sources and analysis tools.

Data Access

In order to access biological content of the marts available through the BioMart web server, a query request is used. Figure 3a illustrates an example query in MartSoapService format that spans two datasets (Ensembl Homo Sapiens & Reactome Pathways) residing at different locations (Sanger & CSHL). The query finds the alleles in genes involved in the regulation of DNA replication. A user can specify the attributes of interest along with any possible limitations (filters) from a given dataset/s and in return gets results as shown in Figure 3b. Users are neither expected to ascertain the database specific access protocol, nor its physical location. From a user's point of view, all datasets appear to be residing at BioMart Central Portal that takes care of all underlying federation logic.
Figure 3.

(a) SOAP request envelope representing data federation between Ensembl Homo Sapiens (Sanger-UK) and Reactome pathway (CSHL-US) datasets. The query finds the alleles in genes involved in the regulation of DNA replication (b) SOAP response envelope for the query shown in figure 3a.

(a) SOAP request envelope representing data federation between Ensembl Homo Sapiens (Sanger-UK) and Reactome pathway (CSHL-US) datasets. The query finds the alleles in genes involved in the regulation of DNA replication (b) SOAP response envelope for the query shown in figure 3a.

Query processing

The BioMart server-side software constitutes of a QueryPlanner and an Aggregator. The QueryPlanner consumes data access queries and formulates an execution plan. If BioMart Central Portal has direct access credentials to the database server, then SQL statements are compiled, otherwise XML-based web service requests are sent to the remote BioMart web server over HTTP stream and results are retrieved over the same connection. The execution scheme consists of ANSI SQL statements (to ensure compatibility across MySQL, Oracle and PostgreSQL) or web service requests or combination of both if a query involves one or more datasets providing direct database access and others proving only web service access. To minimize database or HTTP time-outs and slow response times, the query engine uses a sophisticated batching system that performs the job over several iterations. The results are piped back to the user as soon as the first batch in finished. The Aggregator component enables merging of data coming from different sources on a common concept. This is achieved by extending the afore-mentioned abstractions, Attributes and Filters, to Exportables and Importables. A dataset that exposes an attribute as exportable is able to integrate data from all those sources whereby a filter with similar name is tagged as importable. The exportables and importables are columns with similar contents in a database table. The aggregation of results is an in-memory operation that does not prove to be very costly given the batching model described above.

Registry

The BioMart Central Portal does not store any data locally except meta information of all the datasets. The server maintains a registry containing references to remote BioMart web servers. To add a new mart to this registry, we only require the URL of the BioMart server hosting the databases or read access to the database server. This information is added to the registry file of the web server and following a configuration rerun, the whole bioinformatics community can benefit from the data through BioMart Central Portal as well as several third party softwares, see www.biomart.org for a complete list. The web server stays in sync with any of the data updates carried out on various databases. However, updates relating to metadata are made available shortly after the stable release of such updates upon reconfiguration of the web server.

FUTURE DIRECTIONS

We are working on extending the system to support multiple and more specialized web GUIs. This includes integration of analysis and visualization plugins with special focus on cancer research. We also envisage substantial development towards semantic annotation of attributes and filters by data publishers that would enhance the interoperability of mart datasets with analysis tools and non-BioMart databases. MartServiceSoap provides a complete framework to define ontology references for the annotation of these terms and we would like to collaborate with data providers to achieve this goal.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Ontario Institute for Cancer Research; the Wellcome Trust, EMBL; the European Commission within its FP6 Programme under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LHSG-CT-2004-512092. Funding for open access charge: Ontario Government and Ministry of Research and Innovation. Conflict of interest statement. None declared.
  11 in total

1.  EnsMart: a generic system for fast and flexible access to biological data.

Authors:  Arek Kasprzyk; Damian Keefe; Damian Smedley; Darin London; William Spooner; Craig Melsopp; Martin Hammond; Philippe Rocca-Serra; Tony Cox; Ewan Birney
Journal:  Genome Res       Date:  2004-01       Impact factor: 9.043

2.  A second generation human haplotype map of over 3.1 million SNPs.

Authors:  Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal:  Nature       Date:  2007-10-18       Impact factor: 49.962

3.  WormBase: new content and better access.

Authors:  Tamberlyn Bieri; Darin Blasiar; Philip Ozersky; Igor Antoshechkin; Carol Bastiani; Payan Canaran; Juancarlos Chan; Nansheng Chen; Wen J Chen; Paul Davis; Tristan J Fiedler; Lisa Girard; Michael Han; Todd W Harris; Ranjana Kishore; Raymond Lee; Sheldon McKay; Hans-Michael Müller; Cecilia Nakamura; Andrei Petcherski; Arun Rangarajan; Anthony Rogers; Gary Schindelman; Erich M Schwarz; Will Spooner; Mary Ann Tuli; Kimberly Van Auken; Daniel Wang; Xiaodong Wang; Gary Williams; Richard Durbin; Lincoln D Stein; Paul W Sternberg; John Spieth
Journal:  Nucleic Acids Res       Date:  2006-11-11       Impact factor: 16.971

4.  BioMart--biological queries made easy.

Authors:  Damian Smedley; Syed Haider; Benoit Ballester; Richard Holland; Darin London; Gudmundur Thorisson; Arek Kasprzyk
Journal:  BMC Genomics       Date:  2009-01-14       Impact factor: 3.969

5.  Ensembl 2009.

Authors:  T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal:  Nucleic Acids Res       Date:  2008-11-25       Impact factor: 16.971

6.  GenBank.

Authors:  Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

7.  Reactome: a knowledge base of biologic pathways and processes.

Authors:  Imre Vastrik; Peter D'Eustachio; Esther Schmidt; Geeta Joshi-Tope; Gopal Gopinath; David Croft; Bernard de Bono; Marc Gillespie; Bijay Jassal; Suzanna Lewis; Lisa Matthews; Guanming Wu; Ewan Birney; Lincoln Stein
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

8.  The universal protein resource (UniProt).

Authors: 
Journal:  Nucleic Acids Res       Date:  2007-11-27       Impact factor: 16.971

9.  The HGNC Database in 2008: a resource for the human genome.

Authors:  Elspeth A Bruford; Michael J Lush; Mathew W Wright; Tam P Sneddon; Sue Povey; Ewan Birney
Journal:  Nucleic Acids Res       Date:  2007-11-04       Impact factor: 16.971

10.  PRIDE: new developments and new datasets.

Authors:  Philip Jones; Richard G Côté; Sang Yun Cho; Sebastian Klie; Lennart Martens; Antony F Quinn; David Thorneycroft; Henning Hermjakob
Journal:  Nucleic Acids Res       Date:  2007-11-22       Impact factor: 16.971

View more
  182 in total

Review 1.  Genomic architecture of MHC-linked odorant receptor gene repertoires among 16 vertebrate species.

Authors:  Pablo Sandro Carvalho Santos; Thomas Kellermann; Barbara Uchanska-Ziegler; Andreas Ziegler
Journal:  Immunogenetics       Date:  2010-08-03       Impact factor: 2.846

2.  Semantic integration of data on transcriptional regulation.

Authors:  Michael Baitaluk; Julia Ponomarenko
Journal:  Bioinformatics       Date:  2010-04-28       Impact factor: 6.937

3.  Identifying gene interaction networks.

Authors:  Gurkan Bebek
Journal:  Methods Mol Biol       Date:  2012

4.  Lineage-specific duplications of Muroidea Faim and Spag6 genes and atypical accelerated evolution of the parental Spag6 gene.

Authors:  Huan Qiu; Aniela Gołas; Paweł Grzmil; Leszek Wojnowski
Journal:  J Mol Evol       Date:  2013-09-27       Impact factor: 2.395

5.  Disease and phenotype data at Ensembl.

Authors:  Giulietta M Spudich; Xosé M Fernández-Suárez
Journal:  Curr Protoc Hum Genet       Date:  2011-04

6.  Dengue-2 structural proteins associate with human proteins to produce a coagulation and innate immune response biased interactome.

Authors:  Brenda B Folly; Almeriane M Weffort-Santos; C G Fathman; Luis R B Soares
Journal:  BMC Infect Dis       Date:  2011-01-31       Impact factor: 3.090

7.  Major role for mRNA stability in shaping the kinetics of gene induction.

Authors:  Ran Elkon; Eitan Zlotorynski; Karen I Zeller; Reuven Agami
Journal:  BMC Genomics       Date:  2010-04-21       Impact factor: 3.969

8.  Systems integration of biodefense omics data for analysis of pathogen-host interactions and identification of potential targets.

Authors:  Peter B McGarvey; Hongzhan Huang; Raja Mazumder; Jian Zhang; Yongxing Chen; Chengdong Zhang; Stephen Cammer; Rebecca Will; Margie Odle; Bruno Sobral; Margaret Moore; Cathy H Wu
Journal:  PLoS One       Date:  2009-09-25       Impact factor: 3.240

9.  COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer.

Authors:  Simon A Forbes; Gurpreet Tang; Nidhi Bindal; Sally Bamford; Elisabeth Dawson; Charlotte Cole; Chai Yin Kok; Mingming Jia; Rebecca Ewing; Andrew Menzies; Jon W Teague; Michael R Stratton; P Andrew Futreal
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

10.  Ensembl's 10th year.

Authors:  Paul Flicek; Bronwen L Aken; Benoit Ballester; Kathryn Beal; Eugene Bragin; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Julio Fernandez-Banet; Leo Gordon; Stefan Gräf; Syed Haider; Martin Hammond; Kerstin Howe; Andrew Jenkinson; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Gautier Koscielny; Eugene Kulesha; Daniel Lawson; Ian Longden; Tim Massingham; William McLaren; Karine Megy; Bert Overduin; Bethan Pritchard; Daniel Rios; Magali Ruffier; Michael Schuster; Guy Slater; Damian Smedley; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Albert Vilella; Jan Vogel; Simon White; Steven P Wilder; Amonida Zadissa; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; James Smith; Stephen M J Searle
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.