| Literature DB >> 17087823 |
Philipp N Seibel1, Jan Krüger, Sven Hartmeier, Knut Schwarzer, Kai Löwenthal, Henning Mersch, Thomas Dandekar, Robert Giegerich.
Abstract
BACKGROUND: Today, there is a growing need in bioinformatics to combine available software tools into chains, thus building complex applications from existing single-task tools. To create such workflows, the tools involved have to be able to work with each other's data--therefore, a common set of well-defined data formats is needed. Unfortunately, current bioinformatic tools use a great variety of heterogeneous formats.Entities:
Mesh:
Year: 2006 PMID: 17087823 PMCID: PMC2001303 DOI: 10.1186/1471-2105-7-490
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Commonly used XML formats and their features
| AGAVE | sequence/annotation | XML schema available, stable, format is open and seems to be actively maintained, well documented | XML schema is in BETA status (since Feb. 2003), XML schema defines no namespace, no restriction of sequence data |
| BioML | sequence/annotation | - | no XML schema available (DTD only), unclear if it is stable and maintained (last modified 1999) |
| BioSeq | plugin of readseq | - | no XML schema available (DTD can be generated), maintenance and stability unclear, undocumented |
| BSML | sequence/annotation, sequence alignments | well documented | no XML schema available (DTD only), unclear if it is maintained any longer (last updated 2002) |
| chadoXML | data base format | - | no XML schema available (DTD can be generated), part of the GMOD XORT software package, undocumented |
| EMBLxml | sequence data base format | XML schema available | XML schema defines no namespace, no restriction on content elements |
| GAMEXML | sequence/annotion | used in different OS projects, seem to be stable | no XML schema available (DTD only), maintenance unclear |
| INSDseq | sequence data base format | lightweight | no XML schema available (DTD only) |
| MSAML | sequence alignments | - | no XML schema available, project page unreachable (DTD on third party page), maintenance unclear |
| RNAML | RNA sequence, structure and experimental data | XML schema available, well documented | XML schema defines no namespace, complex and unmanageable, license and maintenance unclear (last modified 2002) |
| TinySeq | sequence data | stable, active, lightweight | no XML schema available (DTD only), undocumented |
The list above contains a summary evaluation of formats with the same scope of application as the HOBIT formats. A more complete list (including detailed features) is available at [57].
Comparison of native formats and their HOBIT XML counterparts
| FASTA | SequenceML | simple sequence information for nucleic and amino acids |
| GCG | SequenceAnnotationML | sequence information with additional facilities for annotations |
| STADEN | ||
| FASTA | AlignmentML | (multiple) alignments for nucleic and amino acids |
| CLUSTAL | ||
| MSF | ||
| mFOLD | RNAStructML | RNA secondary structure information |
| Vienna style DotBracket | ||
| aligned Vienna style DotBracket | RNAStructAlignmentML | (multiple) alignments of RNA secondary structures |
The table shows a comparison of some native bioinformatic file formats (first column) and their HOBIT XML counterparts (second column). These XML formats cover sequence, alignment, RNA secondary structure and RNA secondary structure alignment formats in a form that is independent of any specific program. The usage of the XML formats leads to a significant decrease in the number of necessary file formats.
Figure 1Basic concept of HOBIT XML schemas. The basic concept of HOBIT XML schemas is explained step by step using SequenceML as an example. First an amino acid sequence with id and description in the well known FASTA format is converted to SequenceML. The color coding highlights the transformed content. In SequenceML it is possible to differentiate between various sequence types (in this case an amino acid sequence), defined by the SequenceML schema. The SequenceML schema derives its basic type information from BioTypes.
Figure 3Example of BioDOM integration. This code sample shows exemplarily how to use the SequenceML API of BioDOM. First an empty Se-quenceML object is generated (line 12), afterwards a FASTA formatted file is appended (line 14). Some of the possibilities for further processing, as shown in the comments, are given in lines 16 to 20.
Available webservices supporting HOBIT XML formats
| CLUSTALW [18] | general purpose multiple sequence alignment program | SequenceML/AlignmentML |
| DCA [37] | divide-and-conquer multiple sequence alignment program | SequenceML/AlignmentML |
| Dialign [38] | computation of an alignment based on segment-to-segment comparison | SequenceML/AlignmentML |
| E2G [52] | aligning genomic sequence to cDNA and EST data sets | SequenceML/EBIApplicationResult |
| ITS2 [58-60] | database of more than 20,000 rRNA internal transcribed spacer 2 (ITS2) secondary structures revealed by homology modeling | -/RNAStructML |
| pknotsRG [61] | RNA secondary structures folding, including the class of simple recursive pseudoknots | SequenceML/RNAStructML |
| realsplice | information about regulated alternative spliced genes | -/SequenceAnnotationML |
| REPuter [62] | computes all maximal duplications and reverse, complemented and reverse complemented repeats in a nucleic acid sequence | SequenceML/EBIApplicationResult |
| RNAshapes [39] | computation of a small set of representative structures of different shapes, computation of shape probabilities, and comparative prediction of consensus structures | SequenceML/RNAStructML |
| RNAfold [19] | RNA secondary structures folding | SequenceML/RNAStructML |
Figure 2Graph representation of the e2g workflow. The e2g webservice gets sequence information in SequenceML format (besides a couple of parameters) as input and returns the result data as an EBIApplicationResult XML document. The input data can originate from a file (containing sequence information as SequenceML) or from an external data source like the SOAPDB webservice (which returns sequence information in FASTA format). Using BioDOM as a converter between the different data formats, it is quite easy to add another data source. The e2g webservice is a workflow itself and also uses webservice technology to mask repeats (using the RepeatMasker webservice) and match the input sequence data against huge EST databases (using the vmatch webservice). The match result is filtered (depending on input parameters) and returned as an EBIApplicationResult document.
WSDL location of SOAP based webservices
| CLUSTALW | |
| DCA | |
| Dialign | |
| E2G | |
| ITS2 | |
| pknotsRG | |
| realsplice | |
| REPuter | |
| RNAshapes | |
| RNAfold |
This table shows the currently available bioinformatic webservices for large variety of tasks which support the HOBIT XML data formats as input and output.