Literature DB >> 19038988

MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB.

Tim F Rayner¹, Faisal Ibne Rezwan, Margus Lukk, Xiangqun Zheng Bradley, Anna Farne, Ele Holloway, James Malone, Eleanor Williams, Helen Parkinson.

Abstract

SUMMARY: The MAGE-TAB format for microarray data representation and exchange has been proposed by the microarray community to replace the more complex MAGE-ML format. We present a suite of tools to support MAGE-TAB generation and validation, conversion between existing formats for data exchange, visualization of the experiment designs encoded by MAGE-TAB documents and the mining of such documents for semantic content.

Entities: Species

Mesh：

Year: 2008 PMID： 19038988 PMCID： PMC2638998 DOI： 10.1093/bioinformatics/btn617

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The data standards developed by the microarray community are among the most mature in the field of functional genomics. These standards include a convention on the disclosure of experimental data and annotation (MIAME; Brazma et al., 2001), an object model (MAGE; Spellman et al., 2002) and a data exchange format (MAGE-ML), and have been widely adopted and used successfully for several years. The MAGE-ML format is highly flexible; however, it is more complex than is typically required for data exchange between applications, and has not been widely accepted. The newly introduced MAGE-TAB format (Rayner et al., 2006), in contrast, is simpler, is human readable and can be opened and edited in any spreadsheet application. To support MAGE-TAB data exchange, demonstrate the utility of the format in dealing with high-throughput data, and make MAGE-TAB accessible to the bioinformatics community, we have developed a series of open-source Perl applications to address common MAGE-TAB use cases.

2 SOFTWARE OVERVIEW

As shown in Figure 1, MAGETabulator consists of a suite of tools that can be used singly or together to perform a variety of tasks: (i) preparation, syntactic and semantic validation of MAGE-TAB formatted data; (ii) visualization of investigation designs encoded in MAGE-TAB; (iii) conversion of MAGE-TAB documents to MAGE-ML format; (iv) conversion of NCBI Gene Expression Omnibus (GEO) SOFT records to MAGE-TAB format; and (v) support for automated post hoc addition of ontology terms to MAGE-TAB documents.

Fig. 1.

Overview of MAGETabulator components. Green arrows represent processes generating MAGE-TAB, blue arrows show potential submission routes to data repositories and red arrows depict utility functions

2.1 Preparation, validation and visualization of MAGE-TAB formatted data

The core components of MAGETabulator can be used to generate MAGE-TAB template documents which can be completed by the researcher, validate the syntax and check the content of the completed document, and parse the document and convert it into MAGE-ML. MAGETabulator thus facilitates the submission of MIAME-compliant data to public repositories which accept either MAGE-TAB or MAGE-ML documents.

2.1.1 Template generation and data submission

Template generation is implemented using a MySQL database to store semantic relationships between standard experiment types from the MGED Ontology (Whetzel et al., 2006) and the sample annotation and experimental variables expected for each experiment type. The user can generate a MAGE-TAB template relevant to their particular organism, technology type and the relevant biological aspects of the system under study. For example, a template produced for an experiment studying development in mice would include fields for mouse strain, sex, developmental stage, age and so on. These relationships are fully configurable via the underlying database. MAGETabulator supplies a Ruby on Rails curation web interface to simplify the template configuration process. The database and template generation web forms can be used via ArrayExpress (at http://www.ebi.ac.uk/cgi-bin/microarray/magetab.cgi), or installed and used locally. Data submissions to ArrayExpress which use MAGE-TAB can accurately describe a wider range of experiment types compared with those submitted using the MIAMExpress web interface, and are usually quicker and easier to complete for larger experiments.

2.1.2 Experiment checker

The MAGE-TAB specification describes a flexible format for its documents. MAGETabulator includes a validation script which confirms that a given MAGE-TAB document is syntactically correct, that the content is internally consistent and that it contains all the elements required by the MIAME guidelines. Any associated data files are also checked for errors. This validator script is written in modular object-oriented Perl, using a recursive-descent parser to decode the document. The code is readily extensible to support parsing of new data file types. Optionally, the script can also use Graphviz (http://www.graphviz.org/) to generate an experimental design graph to visualize the links between samples, hybridizations and data files.

2.2 Data exchange between applications

2.2.1 MAGE-TAB to MAGE-ML converter

MAGETabulator provides a tool for converting a validated MAGE-TAB document into MAGE-ML format. The output is fully compliant with current best practices for encoding experimental metadata in MAGE-ML, and conforms to the ArrayExpress standard for data submissions. While the degree of semantic information expressable in MAGE-ML is greater than that for MAGE-TAB, in practice this extra flexibility is very rarely used. ArrayExpress uses MAGE-ML internally such that all annotation is retained from data submissions using either format.

2.2.2 SOFT to MAGE-TAB converter

To promote data exchange between the GEO and ArrayExpress databases, we have created a pipeline application which can generate MAGE-TAB directly from a GEO Series record, as part of the GEOImport package. In its simplest form, fields in the SOFT file are mapped directly to their counterparts in a new MAGE-TAB document.

2.3 Semantic support for MAGE-TAB documents

The MAGE-TAB format consists entirely of tab-delimited text, and imposes no intrinsic restrictions on the terms used to annotate an experiment. We have therefore adopted a post hoc validation and ontology term matching strategy to provide machine-readable semantic content in the output documents.

2.3.1 GEOImport

Annotation of sample characteristics and other experimental variables in the GEO database is heavily reliant on free text. To improve annotation consistency, MAGETabulator uses a tool that searches for candidate ontology terms within free text descriptions. The tool was implemented using the Java Finite Automata class library monq.jfa (http://www.ebi.ac.uk/Rebholz-srv/whatizit/software), and is incorporated into the GEO import pipeline.

2.3.2 Post hoc additions to MAGE-TAB documents

The MAGE-TAB format supports encoding not only ontology term names, but also their identifiers/accession numbers and the ontologies from which they are derived. This information is optional, however, so MAGETabulator uses the ‘Double Metaphone’ phonetic algorithm (Philips, 1990) to match terms used to annotate an experiment to terms from any ontology in OBO format (http://www.geneontology.org/GO.format.obo-1_2.shtml; Smith et al., 2007). The script reads in an input MAGE-TAB document and produces a new MAGE-TAB document including the matched term identifiers and ontology source.

3 DISCUSSION

The representation of high-throughput array-based data using simple spreadsheets has allowed us to develop a suite of tools that enable a biologist with Perl skills to manage these data. The tools have been developed, tested and used extensively at the ArrayExpress database and, since February 2008, have become the major route of submission to ArrayExpress. The MAGETabulator project is publicly available on the SourceForge web site http://tab2mage.sourceforge.net, where the code is maintained in a Subversion repository and made available for download as part of the Tab2MAGE package. Researchers who generate and use array data will benefit from the free availability of tools that bring data submission within reach of a large segment of the community.

5 in total

1. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

2. The MGED Ontology: a resource for semantics-based description of microarray experiments.

Authors: Patricia L Whetzel; Helen Parkinson; Helen C Causton; Liju Fan; Jennifer Fostel; Gilberto Fragoso; Laurence Game; Mervi Heiskanen; Norman Morrison; Philippe Rocca-Serra; Susanna-Assunta Sansone; Chris Taylor; Joseph White; Christian J Stoeckert
Journal: Bioinformatics Date: 2006-01-21 Impact factor: 6.937

3. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.

Authors: Barry Smith; Michael Ashburner; Cornelius Rosse; Jonathan Bard; William Bug; Werner Ceusters; Louis J Goldberg; Karen Eilbeck; Amelia Ireland; Christopher J Mungall; Neocles Leontis; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Richard H Scheuermann; Nigam Shah; Patricia L Whetzel; Suzanna Lewis
Journal: Nat Biotechnol Date: 2007-11 Impact factor: 54.908

4. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB.

Authors: Tim F Rayner; Philippe Rocca-Serra; Paul T Spellman; Helen C Causton; Anna Farne; Ele Holloway; Rafael A Irizarry; Junmin Liu; Donald S Maier; Michael Miller; Kjell Petersen; John Quackenbush; Gavin Sherlock; Christian J Stoeckert; Joseph White; Patricia L Whetzel; Farrell Wymore; Helen Parkinson; Ugis Sarkans; Catherine A Ball; Alvis Brazma
Journal: BMC Bioinformatics Date: 2006-11-06 Impact factor: 3.169

5. Design and implementation of microarray gene expression markup language (MAGE-ML).

Authors: Paul T Spellman; Michael Miller; Jason Stewart; Charles Troup; Ugis Sarkans; Steve Chervitz; Derek Bernhart; Gavin Sherlock; Catherine Ball; Marc Lepage; Marcin Swiatek; W L Marks; Jason Goncalves; Scott Markel; Daniel Iordan; Mohammadreza Shojatalab; Angel Pizarro; Joe White; Robert Hubley; Eric Deutsch; Martin Senger; Bruce J Aronow; Alan Robinson; Doug Bassett; Christian J Stoeckert; Alvis Brazma
Journal: Genome Biol Date: 2002-08-23 Impact factor: 13.583

5 in total

9 in total

1. A quick guide to large-scale genomic data mining.

Authors: Curtis Huttenhower; Oliver Hofmann
Journal: PLoS Comput Biol Date: 2010-05-27 Impact factor: 4.475

2. Annotare--a tool for annotating high-throughput biomedical investigations and resulting data.

Authors: Ravi Shankar; Helen Parkinson; Tony Burdett; Emma Hastings; Junmin Liu; Michael Miller; Rashmi Srinivasa; Joseph White; Alvis Brazma; Gavin Sherlock; Christian J Stoeckert; Catherine A Ball
Journal: Bioinformatics Date: 2010-08-23 Impact factor: 6.937

Review 3. Data standards for Omics data: the basis of data sharing and reuse.

Authors: Stephen A Chervitz; Eric W Deutsch; Dawn Field; Helen Parkinson; John Quackenbush; Phillipe Rocca-Serra; Susanna-Assunta Sansone; Christian J Stoeckert; Chris F Taylor; Ronald Taylor; Catherine A Ball
Journal: Methods Mol Biol Date: 2011

4. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors: Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

5. Answering biological questions: querying a systems biology database for nutrigenomics.

Authors: Chris T Evelo; Kees van Bochove; Jahn-Takeshi Saito
Journal: Genes Nutr Date: 2010-10-30 Impact factor: 5.523

6. graph2tab, a library to convert experimental workflow graphs into tabular formats.

Authors: Marco Brandizi; Natalja Kurbatova; Ugis Sarkans; Philippe Rocca-Serra
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

7. A Minimal Information Model for Potential Drug-Drug Interactions.

Authors: Harry Hochheiser; Xia Jing; Elizabeth A Garcia; Serkan Ayvaz; Ratnesh Sahay; Michel Dumontier; Juan M Banda; Oya Beyan; Mathias Brochhausen; Evan Draper; Sam Habiel; Oktie Hassanzadeh; Maria Herrero-Zazo; Brian Hocum; John Horn; Brian LeBaron; Daniel C Malone; Øystein Nytrø; Thomas Reese; Katrina Romagnoli; Jodi Schneider; Louisa Yu Zhang; Richard D Boyce
Journal: Front Pharmacol Date: 2021-03-08 Impact factor: 5.810

8. Semantic web data warehousing for caGrid.

Authors: Jamie P McCusker; Joshua A Phillips; Alejandra González Beltrán; Anthony Finkelstein; Michael Krauthammer
Journal: BMC Bioinformatics Date: 2009-10-01 Impact factor: 3.307

9. ArrayExpress update--trends in database growth and links to data analysis tools.

Authors: Gabriella Rustici; Nikolay Kolesnikov; Marco Brandizi; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Jon Ison; Maria Keays; Natalja Kurbatova; James Malone; Roby Mani; Annalisa Mupo; Rui Pedro Pereira; Ekaterina Pilicheva; Johan Rung; Anjan Sharma; Y Amy Tang; Tobias Ternent; Andrew Tikhonov; Danielle Welter; Eleanor Williams; Alvis Brazma; Helen Parkinson; Ugis Sarkans
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

9 in total