Literature DB >> 21278190

A Ruby API to query the Ensembl database for genomic features.

Francesco Strozzi1, Jan Aerts.   

Abstract

UNLABELLED: The Ensembl database makes genomic features available via its Genome Browser. It is also possible to access the underlying data through a Perl API for advanced querying. We have developed a full-featured Ruby API to the Ensembl databases, providing the same functionality as the Perl interface with additional features. A single Ruby API is used to access different releases of the Ensembl databases and is also able to query multi-species databases.
AVAILABILITY AND IMPLEMENTATION: Most functionality of the API is provided using the ActiveRecord pattern. The library depends on introspection to make it release independent. The API is available through the Rubygem system and can be installed with the command gem install ruby-ensembl-api.

Entities:  

Mesh:

Year:  2011        PMID: 21278190      PMCID: PMC3065687          DOI: 10.1093/bioinformatics/btr050

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The Ensembl (Flicek ) and UCSC (Fujita ) genome browsers are the first point of call for a large community of genetics and genomics researchers. Both provide a graphical interface for browsing the genomes of a large number of species, displaying the location of genes, polymorphisms, repeats and regulatory regions. Each database can also be accessed directly via SQL and provides an interface for simple querying of the data: BioMart for Ensembl (Haider ) and the Table Browser for UCSC. In addition, the Ensembl team provides a Perl API for advanced scripted access to the data (Flicek ). In recent years, the Python and Ruby scripting languages have gained significant ground in the bioinformatics community (Aerts and Law, 2009; Cock ; Goto , see e.g.), increasing the need for a programmable interface in these languages. In this article, we describe a second API to the Ensembl database, focusing on the Ruby programming community.

2 IMPLEMENTATION

The data available in the Ensembl Genome Browser is stored in a set of MySQL relational databases and to a certain extent normalized. Every table covers one specific conceptual class of objects, such as ‘genes’ or ‘transcripts’. The Ruby ActiveRecord library is used to map tuples within the Core and Variation tables to objects of a class, and delivers a full API with a limited amount of code. As does the perl API, the Ruby library provides a Slice class which describes a region of the genome. Each class that defines a locus (e.g. gene, simple_feature, assembly_exception) includes the Sliceable mixin which provides additional methods such as returning the sequence of the object or its position. The Ruby and Perl Slice classes differ however significantly at the conceptual level. The slice for a feature (e.g. a gene) in the perl API is the complete seq_region this feature is located on, commonly the whole chromosome: e.g. ‘chromosome 13’ for the gene BRCA2. In contrast, the slice in the Ruby version is delimited by the boundaries of the feature: ‘chromosome:GRCh37:13:32889611:32973347:1’ for the same gene. As a result, methods such as overlaps?, contains? and within? are available for each object that implements Sliceable. The Ruby API to Ensembl is not part of the BioRuby project, but might be linked to it as a plugin in the future.

3 FEATURES

The user provides the species name (e.g. ‘Homo sapiens’) and an optional release number to connect to the Ensembl database. It is not necessary to make the distinction between Core or Variation; the code will internally open connections to either if necessary. In addition and in contrast to the Perl API, only a single Ruby interface is necessary for every Ensembl release. The Ruby API is also able to work with Ensembl Genomes databases, where multiple species are stored within the same database (e.g. bacterial, fungal and plant genomes; Kersey ). The Ruby Ensembl API provides—to our knowledge—the same functionality as the Perl API where concerning the Core and Variation databases. Class methods cover searching for records: every column in the table is available for querying by preceding it with find_by_. These class methods bypass the need for Adaptor objects as in the Perl API. Instance methods, on the other hand, work on records and for example provide access to specific data for a single column in a given row (e.g. my_gene.start). As each method call returns an object, different method calls can be chained together. To obtain the start position of the first transcript for a gene my_gene, the user can invoke my_gene.transcripts[0].start. Functionality that is not automatically covered by the ActiveRecord pattern but that is present in the Perl API is also provided. This includes but is not limited to converting genomic positions between different coordinate systems (e.g. between chromosome, scaffold and contig) and SNP effect prediction. In addition, the user can ask the API what types of object are related to each other, thus providing additional real-time documentation for every class in the API. The library provides two binaries. The ensembl command takes a species and release number as argument and drops the user in an interactive Ruby session for quick querying of the Ensembl database without the need to write one-off scripts. In addition, the variation_effect_predictor script takes a file containing known or novel SNPs/indels and annotates this list with a consequence type (e.g. essential_splice_site, stop_gained). Future efforts will focus on extending the API to the Compara and FunctionalGenomics databases which provide data for multi-species comparisons and for functional as well as regulatory information.

3.1 Example

Figure 1 shows an example of using the Ruby Ensembl API. Lines 1 to 3 load the library and connect to the Core and Variation databases for human release 60. In lines 4 and 5, the BRCA2 gene is retrieved and gene name and location are printed. The #slice method creates a Slice object, which describes the locus itself and is serialized into a string. Lines 6 and 7 retrieve all variations within the gene and report on the total number of variations and the dbSNP accession of the first one. In lines 8 and 9, a locus is defined and the code checks if the BRCA2 gene is within that locus. Finally, in lines 10 to 16, a slice of the genome is selected directly. For every gene in this slice, the gene name in printed as well as the stable ID and number of exons for each transcript.
Fig. 1.

Example script using the Ruby Ensembl API.

Example script using the Ruby Ensembl API.

4 CONCLUSION

The library described here provides the functionality needed to query the Ensembl database using the Ruby programming language. This API has several advantages compared to the Perl version, including a single API for all releases, terser code, a powerful interactive shell, a more useful implementation of the Slice concept and extensive introspection. The Ruby Ensembl API is also ideal for e.g. adding background information on candidate genes from the Ensembl database in applications geared at clinical geneticists (e.g. Annotate-It; Sifrim,A. et al., manuscript in preparation). From a library maintainer's perspective, the metaprogramming and introspection capabilities of the Ruby language and the ActiveRecord module allow for providing full functionality and easy maintenance with minimal effort. They enable the Ruby API to be very flexible and by definition virtually insensitive to adding or removing data columns in tables. The API is available through the Rubygem system and can be installed with the command gem install ruby-ensembl-api. Source code is at http://github.com/jandot/ruby-ensembl-api. Documentation and an extensive tutorial (with permission modified from the Perl API) is also available at the GitHub website.
  6 in total

1.  Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors:  Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal:  Bioinformatics       Date:  2009-03-20       Impact factor: 6.937

2.  Ensembl Genomes: extending Ensembl across the taxonomic space.

Authors:  P J Kersey; D Lawson; E Birney; P S Derwent; M Haimel; J Herrero; S Keenan; A Kerhornou; G Koscielny; A Kähäri; R J Kinsella; E Kulesha; U Maheswari; K Megy; M Nuhn; G Proctor; D Staines; F Valentin; A J Vilella; A Yates
Journal:  Nucleic Acids Res       Date:  2009-11-01       Impact factor: 16.971

3.  BioRuby: bioinformatics software for the Ruby programming language.

Authors:  Naohisa Goto; Pjotr Prins; Mitsuteru Nakao; Raoul Bonnal; Jan Aerts; Toshiaki Katayama
Journal:  Bioinformatics       Date:  2010-08-25       Impact factor: 6.937

4.  Ensembl 2011.

Authors:  Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Bert Overduin; Bethan Pritchard; Harpreet Singh Riat; Daniel Rios; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Jana Vandrovcova; Albert J Vilella; Simon White; Steven P Wilder; Amonida Zadissa; Jorge Zamora; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Jan Vogel; Stephen M J Searle
Journal:  Nucleic Acids Res       Date:  2010-11-02       Impact factor: 16.971

5.  BioMart Central Portal--unified access to biological data.

Authors:  Syed Haider; Benoit Ballester; Damian Smedley; Junjun Zhang; Peter Rice; Arek Kasprzyk
Journal:  Nucleic Acids Res       Date:  2009-05-06       Impact factor: 16.971

6.  An introduction to scripting in Ruby for biologists.

Authors:  Jan Aerts; Andy Law
Journal:  BMC Bioinformatics       Date:  2009-07-16       Impact factor: 3.169

  6 in total
  6 in total

1.  Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease.

Authors:  Alejandro Sifrim; Jeroen Kj Van Houdt; Leon-Charles Tranchevent; Beata Nowakowska; Ryo Sakai; Georgios A Pavlopoulos; Koen Devriendt; Joris R Vermeesch; Yves Moreau; Jan Aerts
Journal:  Genome Med       Date:  2012-09-26       Impact factor: 11.117

2.  Functional understanding of the diverse exon-intron structures of human GPCR genes.

Authors:  Dorothy A Hammond; Victor Olman; Ying Xu
Journal:  J Bioinform Comput Biol       Date:  2013-12-11       Impact factor: 1.122

3.  JEnsembl: a version-aware Java API to Ensembl data systems.

Authors:  Trevor Paterson; Andy Law
Journal:  Bioinformatics       Date:  2012-09-03       Impact factor: 6.937

4.  Comparative genomics of eukaryotic small nucleolar RNAs reveals deep evolutionary ancestry amidst ongoing intragenomic mobility.

Authors:  Marc P Hoeppner; Anthony M Poole
Journal:  BMC Evol Biol       Date:  2012-09-15       Impact factor: 3.260

5.  The Ruby UCSC API: accessing the UCSC genome database using Ruby.

Authors:  Hiroyuki Mishima; Jan Aerts; Toshiaki Katayama; Raoul J P Bonnal; Koh-ichiro Yoshiura
Journal:  BMC Bioinformatics       Date:  2012-09-21       Impact factor: 3.169

6.  AnnotateGenomicRegions: a web application.

Authors:  Luca Zammataro; Rita DeMolfetta; Gabriele Bucci; Arnaud Ceol; Heiko Muller
Journal:  BMC Bioinformatics       Date:  2014-01-10       Impact factor: 3.169

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.