| Literature DB >> 23235920 |
James Christopher Bare, Nitin S Baliga.
Abstract
Understanding biological complexity demands a combination of high-throughput data and interdisciplinary skills. One way to bring to bear the necessary combination of data types and expertise is by encapsulating domain knowledge in software and composing that software to create a customized data analysis environment. To this end, simple flexible strategies are needed for interconnecting heterogeneous software tools and enabling data exchange between them. Drawing on our own work and that of others, we present several strategies for interoperability and their consequences, in particular, a set of simple data structures--list, matrix, network, table and tuple--that have proven sufficient to achieve a high degree of interoperability. We provide a few guidelines for the development of future software that will function as part of an interoperable community of software tools for biological data analysis and visualization.Entities:
Keywords: bioinformatics; data analysis; integration; interoperability; software engineering; systems biology
Mesh:
Year: 2012 PMID: 23235920 PMCID: PMC4103535 DOI: 10.1093/bib/bbs074
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:A biological data analysis workflow to cluster and characterize gene expression data. A gene expression matrix derived by microarrays or sequencing experiments is clustered (here we use the data exploration tool MeV) producing lists of co-expressed genes, which are then passed to two web resources for further analysis. KEGG takes gene lists and finds relevant metabolic pathways. DAVID computes functional enrichment.
Strategies for interoperability
| Adapter | A component that translates between incompatible interfaces, protocols or content. |
| API | Application programming interface; functionality exposed for use by external components. |
| Broker (mediator or arbitrator) | An intermediary that coordinates interaction between components, serving as the hub in a hub-and-spokes architecture. |
| Message passing | Sending data from one process to one or more independent processes. |
| Plug-in architecture | Run-time integration of separately developed task-specific functionality into a general-purpose host program. |
| RPC | Remote procedure call; a style of interaction characterized by synchronous invocation of specific functionality running in another process. |
| Shared representation | A commonly understood data format accessed by multiple programs; for example, a shared DB, a common file format (FASTA, GFF, SAM & BAM, RDF and ontologies). A message payload or arguments to an API call can also serve as a shared representation. |
| Streaming | Processing partial data as it arrives without waiting for a complete transmission. |
| Web services | An API made available over web protocols (HTTP). SOAP and REST are two common styles. |
| Workflow | A repeatable pattern of data processing and transformation designed by arranging separate software components to carry out distinct steps. |
Figure 2:The shapes of scientific data. A wide variety of scientific data can be represented by a handful of fundamental data structures. A list might hold protein or gene identifiers. Networks represent regulatory influence, metabolic pathways or protein interactions. Numeric data resides in matrices, for example a gene expression matrix or promoter motif PSSM. The combination of tabular data and matrices could enable ChIP-chip data, tiling array data and genome features to be plotted by location in the genome. A bicluster, a set of genes co-expressed under specific conditions, might be represented by the combination of a list of genes, a list of conditions and a gene expression matrix, tied together in a tuple (hierarchically nested key-value pairs). Tuples may also represent experiment design (metadata about media, environmental variables or patient data).