Literature DB >> 19433511

BioBIKE: a Web-based, programmable, integrated biological knowledge base.

Jeff Elhai¹, Arnaud Taton, J P Massar, John K Myers, Mike Travers, Johnny Casey, Mark Slupesky, Jeff Shrager.

Abstract

BioBIKE (biobike.csbc.vcu.edu) is a web-based environment enabling biologists with little programming expertise to combine tools, data, and knowledge in novel and possibly complex ways, as demanded by the biological problem at hand. BioBIKE is composed of three integrated components: a biological knowledge base, a graphical programming interface and an extensible set of tools. Each of the five current BioBIKE instances provides all available information (genomic, metabolic, experimental) appropriate to a given research community. The BioBIKE programming language and graphical programming interface employ familiar operations to help users combine functions and information to conduct biologically meaningful analyses. Many commonly used tools, such as Blast and PHYLIP, are built-in, allowing users to access them within the same interface and to pass results from one to another. Users may also invent their own tools, packaging complex expressions under a single name, which is immediately made accessible through the graphical interface. BioBIKE represents a partial solution to the difficult question of how to enable those with no background in computer programming to work directly and creatively with mass biological information. BioBIKE is distributed under the MIT Open Source license. A description of the underlying language and other technical matters is available at www.Biobike.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19433511 PMCID： PMC2703918 DOI： 10.1093/nar/gkp354

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Research in all areas of biology has come increasingly to rely upon massive sets of digital data and knowledge, the manipulation of which places most researchers outside their area of comfort. Despite a spectacular range of resources available to analyze biological information (witness this issue of NAR), biological problems still often require the development of novel methods. Existing tools may display results that are easy for humans to read, but they generally do not deliver them in a form that is useful for subsequent computations. Biologists without programming expertise (no doubt the majority) muddle through as best they can, using isolated tools and spreadsheets, or seeking the help of programmers. In the latter case, the resulting division of knowledge is far from ideal, obscuring the process from the biologist's view and making it difficult to understand the meaning of the results. Moreover, the biologist loses easy access to surprising intermediate results, which are at the heart of fundamental accidental discoveries (Elhai et al., manuscript submitted for publication). BioBIKE (the Biological Integrated Knowledge Environment; formerly BioLingua, 1) has been developed to allow researchers without programming expertise to combine tools, data and knowledge in ways demanded by the biological problem at hand. BioBIKE is composed of three integrated components: (i) a biological knowledge base, (ii) a graphical programming interface and (iii) an extensible set of tools that can be combined in novel ways.

BioBIKE INSTANCES AND THEIR KNOWLEDGE AND DATA BASES

A BioBIKE instance provides a framework for all available information needed by a given research community (Table 1), including sets of genomic sequences, gene annotations, functional descriptions, formal categories (e.g. COG), hierarchical groupings of metabolic reactions linked with genes (from KEGG, 2) and internal tables of Blast scores to support rapid protein comparisons. In addition, an instance may be stocked with experimental data, such as results from microarray or proteomic experiments. Indeed, any data that can be put into a standardized form, such as a table or XML structure, can be integrated into the knowledge-base (in simple cases through built-in resources, otherwise with the help of BioBIKE engineers). All of this knowledge and data are represented in an integrated manner within the BioBIKE frame system.

Table 1.

Current BioBIKEs

CyanoBIKE: Cyanobacteria (42 genomes)

ParaBIKE: Eukaryotic parasites (5 genomes)

StaphyloBIKE: Staphylococcus (45 genomes)

StreptoBIKE: Streptococcus (25 genomes)

ViroBIKE: Viruses (1797 genomes, 20 metagenomes)

BIKE: Used for education (0 genomes)

aAll instances are available through biobike.csbc.vcu.edu

Current BioBIKEs aAll instances are available through biobike.csbc.vcu.edu The availability of integrated data and knowledge on the same server makes possible certain operations that are not practical with data that is distributed across the web. For example, in CyanoBIKE it is a simple matter to find all proteins common to one set of organisms (perhaps user-defined) but not in another, for example those in N2-fixing cyanobacteria that are not found in non-N2-fixing cyanobacteria. Protein similarities and orthologs amongst proteins of organisms outside the database are also available using the same interface, albeit more slowly, through services such as NCBI's Blast (3).

THE BioBIKE GRAPHICAL PROGRAMMING INTERFACE

Checking the VPL (visual programming language) box at the login screen of a BioBIKE instance brings the user to the graphical programming interface. (Users may also access BioBIKE with scripts through a web-based command line interface described in ref. 1.) An example of the function palette and workspace is shown in Figure 1. BioBIKE functions and other constructs are represented by boxes obtained from pull down menus. These may be moved around by familiar actions such as drag-and-drop and copy-paste to form complex expressions. When completed, expressions may be executed by double-clicking them. Results are returned (and sometimes displayed in a human-readable format) so that the user can assess the effect of each step. Data, and whole sessions, may be saved to the BioBIKE server so that incomplete work can be continued later.

Figure 1.

BioBIKE function palette and workspace. The green workspace shows the work of a user looking for a regulatory sequence upstream from a gene, by focusing on sequences common amongst upstream sequences of orthologous genes in related organisms. The first function defines the variable gln-orthologs as the set of orthologs in marine cyanobacteria of a gene the user knows to encode glutamine synthetase. The second function is in the midst of being completed. The user is choosing the newly defined variable from the VARIABLES menu to be inserted into a function that will extract the sequences upstream from all the orthologs and then find statistically overrepresented sequences within the set of sequences, using MEME (6). The design of the BioBIKE language adheres to these principles: Intelligibility: an expression should be intelligible to someone with requisite biological knowledge but no prior experience with BioBIKE. Many concepts of molecular biology, such as codon and ortholog, are incorporated into the language. Computability of results and nesting: BioBIKE functions often display results formatted for human comprehension. In addition to this, functions generally return their results in a form that can serve as input for further analysis. This allows users to compose expressions by taking the result of one function and feeding it into another, producing new results at each turn. This process can be abbreviated by nesting expressions together, as shown in Figure 2.

Figure 2.

Example of a nested function. The function makes an alignment of the sequences of all orthologs of the protein Asr1156, starting as many as 100 amino acids before the nominal beginning of the protein but going backwards only up to the first stop codon. The sequences are labeled with the name of the protein and aligned, using Clustal (5) and visualized using JalView (19). This is the code used to generate an alignment (discussed in Elhai, Taton, Massar, and Shrager, manuscript submitted for publication) that provides evidence against existing annotations of a family of conserved genes and for the use of nonstandard start codons in cyanobacteria. Small working vocabulary: expressions that are related to each other have been brought together within a single function, to reduce the burden on the memory of a new user. For example, the function SEQUENCE-SIMILAR-TO performs all flavors of Blast, or finds sequences differing from a reference by a given number of mismatches, depending on options specified by the user. Implied iteration: the size of biological databases often makes it necessary to perform iterative operations (i.e. loops). Such operations in conventional languages are the bane of those new to programming. Most BioBIKE functions iterate automatically. In Figure 1, for example, a specific gene could be given to the ORTHOLOG-OF function or a list of genes could be given instead. In the latter case, the function returns a list of results, one for each gene. Extensibility: users can define new data and functions which immediately enter the language, becoming instantly accessible through the same sort of menus as built-in objects. In addition to serving as a memory aid, this affords the modular addition of concepts into the language itself. Users not satisfied with the names of concepts built into the language can readily build a private vocabulary if desired. Although specialized for bioinformatics, BioBIKE is built on top of the standard computer language Lisp, and is therefore capable of all operations typical of a general purpose programming language. Behind the scenes, BioBIKE expressions are translated into Lisp and compiled, yielding code that runs at a speed comparable to that of C code. Lisp is a uniquely powerful language, often used to create new specialized languages, as we have done here. R, for example, is written on top of Scheme, a dialect of Lisp (4). Lisp is also the language of choice for artificial intelligence, which continues to inform BioBIKE's development.

BioBIKE TOOLSET AND ADVANCED FACILITIES

BioBIKE provides access to several programs that are commonly used: Blast (3), for sequence searches); Clustal (5), for multiple sequence alignments); Meme (6), for motif discovery; RNAz (7), for discovery of conserved RNA sequences; and Phylip (8), for construction of phylogenetic trees. All are accessed through the same interface, greatly reducing the need to figure out the idiosyncrasies of each resource. Useful tools not already in the language that have Application Programming Interfaces (APIs), or that are capable of running within a Linux environment can generally be added to BioBIKE on request with little difficulty, and thus be made accessible to BioBIKE users through the standard graphical programming interface.

LEARNING BioBIKE AND STYLES OF BioBIKE USAGE

Online tours of BiobIKE are accessible through the BioBIKE portal (biobike.csbc.vcu.edu), and a tour of the resources of the interface and the basic conventions of the language is available through the HELP button. A tour that describes how BioBIKE can be used in motif discovery is included in the Supplementary Material. BioBIKE expressions are often intelligible when read, but new users do not find them easy to write. Those new to BioBIKE often begin by using it as a simple query language, asking, for example: ‘What is the sequence of my favorite gene?’ From there, one might construct a progressive series of queries, each one utilizing on the result of the previous, for example: ‘What are the orthologs of the sequence of my favorite gene?’ ‘What are the upstream sequences of those orthologs?’ ‘What common sequence motifs are found in those upstream sequences?’ This progression of questions might have led to Figure 1. This progressive evaluation style is critical for programming novices (9), and one may continue indefinitely within this style, obtaining useful results. However, it is also possible to create more complex structures from simple elements, facilitated by drag-and-drop and copy-paste operations. Figure 3 provides an example of iteration mixed in with sequential evaluation. Since each simple element may be executed independently by double-clicking on it, users may still examine the intermediate results, even within complex expressions.

Figure 3.

Example of progressive evaluation and iteration in BioBIKE. The pattern of four cysteine residues separated by 2, 2 and 3 amino acids is often found in proteins with 4Fe-4S clusters (20). (A) The first function finds the pattern of cysteines amongst the sequences of all proteins in the cyanobacterium Synechocystis PCC 6803 and assigns the names of the proteins bearing the motif (Result 1) to a user-defined variable called 4fe-4s-proteins. (B) The annotation for each of the proteins is displayed in a separate window (see inset), and the annotations are also returned as result #2. (C) The user is concerned that this motif might well arise by chance on some proteins of Synechocystis. To test this, a set of random protein sequences is generated, each element being a random shuffling of a real protein sequence. The set is assigned to the variable random-sequences, and the random sequences are returned as result #3. (D) This set of random sequences is searched for the characteristic motif, and none are found (no result), lending some confidence to the belief that the presence of the motif in proteins of Synechocystis is of biological significance. Complex expressions or sub-expressions can also be collapsed visually into single boxes, making it easier to grasp the larger picture. Moreover, as mentioned above, BioBIKE itself is extensible: if a user should devise a complex expression that might be of continued utility, the expression can be packaged, given a unique name, and made accessible via a menu, no differently from any other BioBIKE function. In this way, complicated operations can be broken up into logical chunks and subsequently offered as distinct functions.

CONCLUSION

BioBIKE represents a novel paradigm regarding the interaction of biologists with information of interest to them. Its goal is to put the analysis of large amounts of information directly into the hands of biologists themselves—to enable them to manipulate biological knowledge and data in an interactive computational environment. This offers extraordinary power to biologists with little computational background. BioBIKE has already made possible a deep analysis of proteomic data (10), a cross-genomic analysis of repeated sequences (11), and the introduction of many dozens undergraduates and high school students to biological analysis on the computer (Elhai, unpublished results). Some excellent web-based resources, such as Entrez (12), provide convenient access to sequences and other information. BioBIKE does the same, but the information is returned in a form that may be used immediately for further analysis. Still other resources, such as IMG (13) and the NCBI implementation of Blast (14) provide a good interface for the analysis of sequences with a fixed set of tools. Some, e.g. Taverna (15) and Galaxy (galaxy.psu.edu), go a step further and facilitate the creation of a work flow using a fixed set of tools (fixed to those unfamiliar with computer programming). BioBIKE does these things as well, but does not confine the user to follow predetermined channels. The user new to programming may use existing tools or combine basic functions to create ways to answer questions for which tools do not exist. Such flexibility has previously required one to employ conventional programming languages, sometimes supplemented with bioinformatic add-ons, such as BioRuby (bioruby.org) or BioPerl (16). BioBIKE is equally powerful but does not require the user to learn the underlying language. BioBIKE is aimed at biologists who do not wish to expend the effort required to learn a conventional programming language but who wish to have the same hands-on relationship with informational objects of study as they do with objects in the laboratory. BioBIKE is a first step in a new direction, and although even in its present state it is a powerful tool, it must be stressed that the goal of intuitive use remains unmet. Users should not expect to figure out BioBIKE as they would a simple web-based tool that offers a small number of functions. A greater set of tours and help pages may increase the ability of naïve users to exploit the resource independently. However, our current direction is more ambitious. We plan to extend BioDeducta (17), which combines BioBIKE with an automated reasoning system, to enable users to present BioBIKE with a natural language question (e.g. ‘Is there a common sequence motif found upstream of orthologs of glnA?’), and through a series of natural language interactions, arrive at a BioBIKE expression that answers the question.

SOFTWARE AVAILABILITY AND COMPATIBILITY

At the time of writing, there are five BioBIKE instances freely available through the web (Table 1). BioBIKE is written in Common Lisp, operating within the KnowOS paradigm (18), and is distributed under the MIT Open Source license. Although it is freely available for anyone to download and install (see www.BioBIKE.org for instructions), we encourage users to use already-existing servers. The authors are happy to discuss collaboration with communities of biologists who would like to create BioBIKE instances particular to sets of model organisms. At present, the graphical interface is only operational within Firefox 1.5 and above.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Foundation (DBI-0516378, DBI-0850146 to J.E.); the National Aeronautics and Space Administration (JRIs: NCC2-5555, NCC2-5462, NCC2-5471 to J.S.); software grants from Franz, Inc. and LispWorks, Inc. Funding for open access charge: Jeff Shrager. Conflict of interest statement. None declared.

16 in total

1. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. BioLingua: a programmable knowledge environment for biologists.

Authors: J P Massar; Michael Travers; Jeff Elhai; Jeff Shrager
Journal: Bioinformatics Date: 2004-08-12 Impact factor: 6.937

3. Quantitative overview of N2 fixation in Nostoc punctiforme ATCC 29133 through cellular enrichments and iTRAQ shotgun proteomics.

Authors: Saw Yen Ow; Josselin Noirel; Tanal Cardona; Arnaud Taton; Peter Lindblad; Karin Stensjö; Philip C Wright
Journal: J Proteome Res Date: 2009-01 Impact factor: 4.466

4. Very small mobile repeated elements in cyanobacterial genomes.

Authors: Jeff Elhai; Michiko Kato; Sarah Cousins; Peter Lindblad; José Luis Costa
Journal: Genome Res Date: 2008-07-03 Impact factor: 9.043

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

7. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2006-12-05 Impact factor: 16.971

8. Deductive biocomputing.

Authors: Jeff Shrager; Richard Waldinger; Mark Stickel; J P Massar
Journal: PLoS One Date: 2007-04-04 Impact factor: 3.240

9. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions.

Authors: Victor M Markowitz; Ernest Szeto; Krishna Palaniappan; Yuri Grechkin; Ken Chu; I-Min A Chen; Inna Dubchak; Iain Anderson; Athanasios Lykidis; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2007-10-12 Impact factor: 16.971

10. NCBI BLAST: a better web interface.

Authors: Mark Johnson; Irena Zaretskaya; Yan Raytselis; Yuri Merezhuk; Scott McGinnis; Thomas L Madden
Journal: Nucleic Acids Res Date: 2008-04-24 Impact factor: 16.971

14 in total

1. A semantic web framework to integrate cancer omics data with biological knowledge.

Authors: Matthew E Holford; James P McCusker; Kei-Hoi Cheung; Michael Krauthammer
Journal: BMC Bioinformatics Date: 2012-01-25 Impact factor: 3.169

2. Subcellular localization and clues for the function of the HetN factor influencing heterocyst distribution in Anabaena sp. strain PCC 7120.

Authors: Laura Corrales-Guerrero; Vicente Mariscal; Dennis J Nürnberg; Jeff Elhai; Conrad W Mullineaux; Enrique Flores; Antonia Herrero
Journal: J Bacteriol Date: 2014-07-21 Impact factor: 3.490

3. Analysis of the 3' ends of tRNA as the cause of insertion sites of foreign DNA in Prochlorococcus.

Authors: Hai-Lan Liu; Jun Zhu
Journal: J Zhejiang Univ Sci B Date: 2010-09 Impact factor: 3.066

4. Detection of horizontal transfer of individual genes by anomalous oligomer frequencies.

Authors: Jeff Elhai; Hailan Liu; Arnaud Taton
Journal: BMC Genomics Date: 2012-06-15 Impact factor: 3.969

5. A natural language interface plug-in for cooperative query answering in biological databases.

Authors: Hasan M Jamil
Journal: BMC Genomics Date: 2012-06-11 Impact factor: 3.969

6. Comparing binding site information to binding affinity reveals that Crp/DNA complexes have several distinct binding conformers.

Authors: Peter C Holmquist; Gerald P Holmquist; Michael L Summers
Journal: Nucleic Acids Res Date: 2011-05-17 Impact factor: 16.971

7. Selection of suitable reference genes for RT-qPCR analyses in cyanobacteria.

Authors: Filipe Pinto; Catarina C Pacheco; Daniela Ferreira; Pedro Moradas-Ferreira; Paula Tamagnini
Journal: PLoS One Date: 2012-04-04 Impact factor: 3.240

8. A systematic comparison of the MetaCyc and KEGG pathway databases.

Authors: Tomer Altman; Michael Travers; Anamika Kothari; Ron Caspi; Peter D Karp
Journal: BMC Bioinformatics Date: 2013-03-27 Impact factor: 3.169

9. Cyanobacterial KnowledgeBase (CKB), a Compendium of Cyanobacterial Genomes and Proteomes.

Authors: Arul Prakasam Peter; Karthick Lakshmanan; Shylajanaciyar Mohandass; Sangeetha Varadharaj; Sivasudha Thilagar; Kaleel Ahamed Abdul Kareem; Prabaharan Dharmar; Subramanian Gopalakrishnan; Uma Lakshmanan
Journal: PLoS One Date: 2015-08-25 Impact factor: 3.240

Review 10. Toward a systems-level understanding of gene regulatory, protein interaction, and metabolic networks in cyanobacteria.

Authors: Miguel A Hernández-Prieto; Trudi A Semeniuk; Matthias E Futschik
Journal: Front Genet Date: 2014-07-02 Impact factor: 4.599