Literature DB >> 18689808

BioJava: an open-source framework for bioinformatics.

R C G Holland¹, T A Down, M Pocock, A Prlić, D Huen, K James, S Foisy, A Dräger, A Yates, M Heuer, M J Schreiber.

Abstract

SUMMARY: BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. AVAILABILITY: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.5 or higher. All queries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.

Entities: Disease Species

Mesh：

Year: 2008 PMID： 18689808 PMCID： PMC2530884 DOI： 10.1093/bioinformatics/btn397

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

BioJava was conceived in 1999 by Thomas Down and Matthew Pocock as an Application Programming Interface (API) to simplify bioinformatics software development using Java (Pocock, 2003; Pocock et al., 2000). It has since then evolved to become a fully featured framework with modules for performing many common bioinformatics tasks. The goal of BioJava is to facilitate code reuse and to provide standard implementations that are easy to link to external scripts and applications. BioJava is an open-source project that is developed by volunteers and coordinated by the Open Bioinformatics Foundation (OBF). It is one of several Bio* toolkits (Mangalam, 2002). All code is distributed under the LGPL license and can be freely used and reused in any form. BioJava is a mature project and has been employed in a number of real-world applications and over 50 published studies. A list of these can be found on the BioJava website. According to the project tracking web site Ohloh (http://www.ohloh.net/projects/biojava), the BioJava code-base represents an estimated 47 person-years worth of effort.

2 FEATURES

BioJava contains a number of mature APIs. The 10 most frequently used are: (1) nucleotide and amino acid alphabets, (2) BLAST parser, (3) sequence I/O, (4) dynamic programming, (5) structure I/O and manipulation, (6) sequence manipulation, (7) genetic algorithms, (8) statistical distributions, (9) graphical user interfaces and (10) serialization to databases. Below follows a short discussion of some of these modules. At the core of BioJava is a symbolic alphabet API which represents sequences as a list of references to singleton symbol objects that are derived from an alphabet. Lists of symbols are stored whenever possible in a compressed form of up to four symbols per byte of memory. In addition to the fundamental symbols of a given alphabet (A, C, G and T in the case of DNA), all BioJava alphabets implicitly contain extra symbol objects representing all possible combinations of the fundamental symbols. The symbol approach allows the construction of higher order alphabets and symbols that represent the multiplication of one or more alphabets. An example is the codon ‘alphabet’ which is the cubed product of the DNA alphabet, each codon ‘symbol’ comprising three DNA symbols. Such an alphabet allows construction of views over sequences without modifying the underlying sequence which is useful for tasks such as translation. Other complex alphabets which can be described include conditional alphabets for the construction of conditional probability distributions, and heterogeneous alphabets such as the combination of the codon and protein alphabets for use with a DNA–protein aligning hidden Markov model (HMM). Other interesting applications of the alphabet API include chromosomes for genetic algorithms using, but not limited to, integer or binary symbol lists, and the representation of Phred quality scores (Ewing et al., 1998) as a multiplication of the DNA and integer alphabets. The typical user would most likely start out by using the sequence input/output API and the sequence/feature object model. These allow sequences to be loaded from a number of common file formats such as FASTA, GenBank and EMBL, optionally manipulated in memory, then saved again or converted into a different format. The simplicity of this process is demonstrated in Figure 1.

Fig. 1.

Loading a GenBank file with BioJava and writing it out as FASTA. The example demonstrates the use of several convenience methods that hide the bulk of the implementation. If the developer desires a more flexible parser it is possible to make use of the interfaces hidden behind the convenience methods to expose a fully customizable, multi-component, event-based parsing model. Another useful API is the feature/annotation object model which associates sequences with located features and unlocated annotations. Features can be found either by keyword or by defining a location query from which all overlapping or contained features are returned, while annotations can be retrieved by keyword. The location model handles circular and stranded locations, split locations and multi-sequence locations allowing features to span complex sets of coordinates. The protein structure API contains tools for parsing and manipulating PDB files (Berman et al., 2000). It contains utility methods to perform linear algebra calculations on atomic coordinates and can calculate 3D structure alignments. A simple interface to the 3D visualization library Jmol (http://www.jmol.org) is contained as well. An add-on allows the serialization of the content of a PDB file to a database using Hibernate (http://www.hibernate.org). Other APIs include those for working with chromatograms, sequence alignments, proteomics and ontologies. Parsers are provided for reading, amongst others, Blast reports (Altschul et al., 1997), ABI chromatograms and NCBI taxonomy definitions. Recently the BioJavaX module was added which provides more detailed parsing of the common file formats and improved storing of sequence data into BioSQL databases (http://www.biosql.org). This allows to incorporate BioJava into existing data processing pipelines which use alternative OBF toolkits such as BioPerl (Stajich et al., 2002). The BioJava web site provides detailed manuals on how to use the different components. In particular, the ‘CookBook’ section provides a quick introduction into solving many problems by demonstrating solutions with documented source code. There is also a section to demonstrate the performance of a few selected tasks via Java WebStart examples. To mention just one: the FASTA-formatted release 4 Drosophila genome sequence can be parsed in <20 s on a 1.80 GHz Core Duo processor.

3 FUTURE DEVELOPMENT

BioJava aims to provide an API that is of use to anyone using Java to develop bioinformatics software, regardless of which specialization they may work in. Genomic features currently must be manipulated with reference to the underlying genomic sequence, which can make working with post-genomic datasets, such as microarray results, overly complex. Phylogenetics tools are already in development which will allow users to work with NEXUS tree files (Maddison et al., 1997). Although the Blast parsing API is widely used, it does not support all of the existing blast-family output formats. We will continue the ongoing effort to add parsers for PSI-Blast and other currently unsupported formats. Users are welcome to identify further areas of need and their suggestions will be incorporated into future developments. BioJava is written entirely in the Java programming language, and will run on any platform for which a Java 1.5 run-time environment is available. Java 5 and 6 provide advanced language features, and we shall be taking advantage of these in the next major release, both to aid in maintenance of the library and to make it even easier for novice Java developers to make use of the BioJava APIs.

4 CONCLUSIONS

BioJava is one of the largest open-source APIs for bioinformatics software development. It is a mature project with a large user and support community. It offers a wide range of tools for common bioinformatics tasks. The BioJava homepage provides access to the source code and detailed documentation.

6 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. NEXUS: an extensible file format for systematic information.

Authors: D R Maddison; D L Swofford; W P Maddison
Journal: Syst Biol Date: 1997-12 Impact factor: 15.683

3. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. The Bio* toolkits--a brief overview.

Authors: Harry Mangalam
Journal: Brief Bioinform Date: 2002-09 Impact factor: 11.622

5. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Authors: B Ewing; L Hillier; M C Wendl; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

Review 6. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6 in total

87 in total

1. Enhanced meta-analysis of acetylcholine binding protein structures reveals conformational signatures of agonism in nicotinic receptors.

Authors: Spencer T Stober; Cameron F Abrams
Journal: Protein Sci Date: 2012-03 Impact factor: 6.725

2. SWIFT MODELLER v2.0: a platform-independent GUI for homology modeling.

Authors: Abhinav Mathur; Ambarish S Vidyarthi
Journal: J Mol Model Date: 2011-12-09 Impact factor: 1.810

3. Evolutionary Ancestry of Eukaryotic Protein Kinases and Choline Kinases.

Authors: Shenshen Lai; Javad Safaei; Steven Pelech
Journal: J Biol Chem Date: 2016-01-07 Impact factor: 5.157

4. BlastGraph: a comparative genomics tool based on BLAST and graph algorithms.

Authors: Yanbo Ye; Bo Wei; Lei Wen; Simon Rayner
Journal: Bioinformatics Date: 2013-09-24 Impact factor: 6.937

5. A toolbox for developing bioinformatics software.

Authors: Kristian Rother; Wojciech Potrzebowski; Tomasz Puton; Magdalena Rother; Ewa Wywial; Janusz M Bujnicki
Journal: Brief Bioinform Date: 2011-07-29 Impact factor: 11.622

6. Reconstructing genome-scale metabolic models with merlin.

Authors: Oscar Dias; Miguel Rocha; Eugénio C Ferreira; Isabel Rocha
Journal: Nucleic Acids Res Date: 2015-04-06 Impact factor: 16.971

7. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

8. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data.

Authors: Bin Chen; Xiao Dong; Dazhi Jiao; Huijun Wang; Qian Zhu; Ying Ding; David J Wild
Journal: BMC Bioinformatics Date: 2010-05-17 Impact factor: 3.169

9. Evidence of molecular evolution driven by recombination events influencing tropism in a novel human adenovirus that causes epidemic keratoconjunctivitis.

Authors: Michael P Walsh; Ashish Chintakuntlawar; Christopher M Robinson; Ijad Madisch; Balázs Harrach; Nolan R Hudson; David Schnurr; Albert Heim; James Chodosh; Donald Seto; Morris S Jones
Journal: PLoS One Date: 2009-06-03 Impact factor: 3.240

10. Identification of gene co-regulatory modules and associated cis-elements involved in degenerative heart disease.

Authors: Charles G Danko; Arkady M Pertsov
Journal: BMC Med Genomics Date: 2009-05-28 Impact factor: 3.063