Literature DB >> 19505942

Importing ArrayExpress datasets into R/Bioconductor.

Audrey Kauffmann¹, Tim F Rayner, Helen Parkinson, Misha Kapushesky, Margus Lukk, Alvis Brazma, Wolfgang Huber.

Abstract

SUMMARY: ArrayExpress is one of the largest public repositories of microarray datasets. R/Bioconductor provides a comprehensive suite of microarray analysis and integrative bioinformatics software. However, easy ways for importing datasets from ArrayExpress into R/Bioconductor have been lacking. Here, we present such a tool that is suitable for both interactive and automated use. AVAILABILITY: The ArrayExpress package is available from the Bioconductor project at http://www.bioconductor.org. A users guide and examples are provided with the package.

Entities: Disease Gene Species

Mesh：

Substances：
RNA

Year: 2009 PMID： 19505942 PMCID： PMC2723004 DOI： 10.1093/bioinformatics/btp354

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

ArrayExpress is a public database for high-throughput functional genomics data (Parkinson et al., 2009). It consists of a repository, which is a MIAME (Brazma et al., 2001) supportive public archive of microarray data, and an added value gene expression Atlas created from the repository data. Currently, nearly 8000 experiments comprising 230 000 arrays are available from ArrayExpress. Retrieving publicly available data for analysis is a repetitive and error prone task for which automation is desirable. As Bioconductor (Gentleman et al., 2004) contains many widely used tools for the data analysis, tools to make a connection with public databases are useful. The GEOquery package (Davis and Meltzer, 2007) was developed to load GEO datasets into Bioconductor, and the RMAGEML package (Durinck et al., 2004) was designed to import the MAGE-ML files that in the past were used by ArrayExpress for data transfer. The ArrayExpress database now supports the MAGE-TAB format (Rayner et al., 2006), a metadata-rich, but much simpler and more resource-efficient format based on tab-delimited files and all data are made available in this format. We have developed the ArrayExpress package for R/Bioconductor to query ArrayExpress and convert MAGE-TAB formatted datasets from the ArrayExpress repository into objects of the Bioconductor class for microarray datasets, eSet.

2 MIAME

MIAME is a guideline that describes the Minimum Information About a Microarray Experiment needed to ensure interpretation of a microarray dataset. It has five elements: (i) the raw data for each hybridization, (ii) the final processed data for the set of hybridizations in the experiment, (iii) the experiment design including sample data relationships and the essential sample annotation including experimental factors and their values, (iv) sufficient annotation of the array design and (v) essential laboratory and data processing protocols.

3 MAGE-TAB

MAGE-TAB is a tabular MIAME supportive file format and MAGE-TAB documents consist of five different types of files. (i) A ‘raw’ zip archive contains the raw data files, i.e. the files produced by the microarray image analysis software, such as CEL files for Affymetrix GeneChips or GPR files from GenePix. (ii) A ‘data matrix’ file contains processed values, as provided by the data submitter, converted into a common tab-delimited text format representing a matrix of numbers. (iii) The Sample and Data Relationship Format (SDRF) tab-delimited file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. (iv) The Array Design Format (ADF) tab-delimited file describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. (v) The Investigation Description Format (IDF) tab-delimited file contains top-level information about the experiment including title, description, submitter contact details and protocols.

4 BIOCONDUCTOR CLASSES

The Bioconductor class eSet is a different implementation of the MIAME standard. The class has various specializations, or subclasses, that are adapted to specific array technologies, among these are ExpressionSet for generic one-colour datasets, NChannelSet for generic two-colour datasets and AffyBatch for data from Affymetrix GeneChips. Objects of this class contain one or more identical-sized numeric matrices as assayData elements. They also include a table describing the sample–array relationship as phenoData and a table describing the array features as featureData. Details of experimental methods are in the component structure experimentData.

5 RETRIEVING AND CONVERTING MAGE-TAB DATA

The ArrayExpress package uses the zip archive with either the raw or the processed data to build the assayData component. The SDRF file is used to construct the phenoData table. The ADF file is used to construct the featureData, and the IDF file to fill in the experimentData components.

5.1 Raw data

To import a raw dataset from ArrayExpress, one can use the following R code after loading the package: As E-ATMX-18 is a two-colour experiment, the returned R object is of class NChannelSet. If the identifier refers to an Affymetrix experiment, the output is an AffyBatch, if it refers to a one-colour experiment using a platform other than Affymetrix, the output is an ExpressionSet. The ArrayExpress function extracts feature intensity summaries from columns of the raw data files based on the common conventions for the data file sources. If the data source is not recognized, or the file does not have the expected column names, the user is asked to explicitly provide the name of the column(s) to extract, for instance, ‘Cy3 Median’. In some cases, there is a mismatch between the sample or feature annotations and the intensity data files; in such cases, a warning is emitted, the phenoData and/or featureData components are left empty and an incomplete (but syntactically valid) object is returned. Tested on the 5298 accessions with raw datasets that were available from the ArrayExpress repository in March 2009, the ArrayExpress function managed to create a complete object in 58% of the cases (Table 1). The 42% of cases in which the function failed or an incomplete object was produced are due to a variety of reasons, including missing or contradictory data in the repository. We are actively working on manually curating these cases and resolving problems as much as possible; however, due to the repository's role as a public record of scientific activity, problems inherent to information submitted by the contributors may persist.

Table 1.

Application of the ArrayExpress package to the ArrayExpress database in March 2009

Number of accessions	6117
Number of datasets	6891
Objects created fully automatically	5550	81%
Complete objects created	4017	58%
Affymetrix	3407
Two-colour	89
One-colour	521
Incomplete objects	1533	22%
Missing feature annotation	1121
Missing sample annotation	466
Objects created with manual selection of columns	619	9%
Object creation failed	722	10%

The number of datasets is higher than the number of accessions since some accessions store multiple datasets (we consider measurements made with different arrays and different datasets). Manual setting of column names was necessary for 1082 (16%) of the 6891 datasets, and we were successful in 619 (9%) cases.

In addition to calling the one-stop function ArrayExpress, it is possible to download the data for local storage using the function getAE and to import a locally stored MAGE-TAB document with the function magetab2bioc.

5.2 Processed data

The way processed data are handled in the database is less uniform than for raw data, because processing methods vary more than the microarray image analysis software outputs. To import a processed dataset from ArrayExpress, three steps are required: download the dataset, identify which column is of interest, create the R object. Example code looks as follows: Here, cn is a character vector of all columns in the processed data, and after visual inspection, we decided to use the second one.

6 APPLICATION

We used the queryAE function to list all datasets concerned with breast cancer in Homo sapiens. Then, using the ArrayExpress function, we created R objects from all datasets for which raw data were available. We counted, for each dataset, the number of arrays and features. The Supplementary table summarizes the results of this analysis. This could now be followed by an integrative analysis of the data, a complex and open-ended task for which essential tools are provided in the Bioconductor project: the quality of the datasets could be assessed with the help of the arrayQualityMetrics package (Kauffmann et al., 2009), they could be normalized and analysed for differential expression of genes and gene sets (Hahne et al., 2008), and the combination of different datasets is facilitated, for example, by the MergeMaid package (Cope et al., 2004).

7 CONCLUSIONS

The ArrayExpress package is freely available, open source and easy to use. As most of the Bioconductor tools for microarray analysis process eSet objects, the package facilitates large-scale analyses of public data. A strength of the package is the richness, accuracy and standardized format of the metadata that it imports together with the array intensity data. In fact, the diagnostics produced by the package during dataset import from the ArrayExpress repository are currently used by the curators to decrease the number of problematic experiments and improve the quality of the content delivered. For the end user, the ArrayExpress package eliminates, or at least greatly reduces the amount of manual intervention needed and helps towards automated processing of large collections of datasets. Table 1. Application of the ArrayExpress package to the ArrayExpress database in March 2009 The number of datasets is higher than the number of accessions since some accessions store multiple datasets (we consider measurements made with different arrays and different datasets). Manual setting of column names was necessary for 1082 (16%) of the 6891 datasets, and we were successful in 619 (9%) cases.

8 in total

1. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

2. Importing MAGE-ML format microarray data into BioConductor.

Authors: Steffen Durinck; Joke Allemeersch; Vincent J Carey; Yves Moreau; Bart De Moor
Journal: Bioinformatics Date: 2004-07-15 Impact factor: 6.937

3. MergeMaid: R tools for merging and cross-study validation of gene expression data.

Authors: Leslie Cope; Xiaogang Zhong; Elizabeth Garrett; Giovanni Parmigiani
Journal: Stat Appl Genet Mol Biol Date: 2004-10-31

4. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.

Authors: Sean Davis; Paul S Meltzer
Journal: Bioinformatics Date: 2007-05-12 Impact factor: 6.937

5. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

6. arrayQualityMetrics--a bioconductor package for quality assessment of microarray data.

Authors: Audrey Kauffmann; Robert Gentleman; Wolfgang Huber
Journal: Bioinformatics Date: 2008-12-23 Impact factor: 6.937

7. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB.

Authors: Tim F Rayner; Philippe Rocca-Serra; Paul T Spellman; Helen C Causton; Anna Farne; Ele Holloway; Rafael A Irizarry; Junmin Liu; Donald S Maier; Michael Miller; Kjell Petersen; John Quackenbush; Gavin Sherlock; Christian J Stoeckert; Joseph White; Patricia L Whetzel; Farrell Wymore; Helen Parkinson; Ugis Sarkans; Catherine A Ball; Alvis Brazma
Journal: BMC Bioinformatics Date: 2006-11-06 Impact factor: 3.169

8. ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression.

Authors: Helen Parkinson; Misha Kapushesky; Nikolay Kolesnikov; Gabriella Rustici; Mohammad Shojatalab; Niran Abeygunawardena; Hugo Berube; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Ele Holloway; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Tim F Rayner; Faisal Rezwan; Anjan Sharma; Eleanor Williams; Xiangqun Zheng Bradley; Tomasz Adamusiak; Marco Brandizi; Tony Burdett; Richard Coulson; Maria Krestyaninova; Pavel Kurnosov; Eamonn Maguire; Sudeshna Guha Neogi; Philippe Rocca-Serra; Susanna-Assunta Sansone; Nataliya Sklyar; Mengyao Zhao; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2008-11-10 Impact factor: 16.971

8 in total

43 in total

1. Conceptual aspects of large meta-analyses with publicly available microarray data: a case study in oncology.

Authors: Markus Schmidberger; Sabine Lennert; Ulrich Mansmann
Journal: Bioinform Biol Insights Date: 2011-01-23

2. Systems genetic analysis of inversion polymorphisms in the malaria mosquito Anopheles gambiae.

Authors: Changde Cheng; John C Tan; Matthew W Hahn; Nora J Besansky
Journal: Proc Natl Acad Sci U S A Date: 2018-07-09 Impact factor: 11.205

3. Modular Transcriptional Networks of the Host Pulmonary Response during Early and Late Pneumococcal Pneumonia.

Authors: Brendon P Scicluna; Miriam H van Lieshout; Dana C Blok; Sandrine Florquin; Tom van der Poll
Journal: Mol Med Date: 2015-05-12 Impact factor: 6.354

4. Spontaneous superimposed preeclampsia: chronology and expression unveiled by temporal transcriptomic analysis.

Authors: Kenji J Maeda; Kurt C Showmaker; Ashley C Johnson; Michael R Garrett; Jennifer M Sasser
Journal: Physiol Genomics Date: 2019-05-24 Impact factor: 3.107

5. Dynamic transcription factor activity networks in response to independently altered mechanical and adhesive microenvironmental cues.

Authors: Beatriz Peñalver Bernabé; Seungjin Shin; Peter D Rios; Linda J Broadbelt; Lonnie D Shea; Stephanie K Seidlits
Journal: Integr Biol (Camb) Date: 2016-07-29 Impact factor: 2.192

6. Blood Gene Signatures of Chagas Cardiomyopathy With or Without Ventricular Dysfunction.

Authors: Ludmila Rodrigues Pinto Ferreira; Frederico Moraes Ferreira; Helder Imoto Nakaya; Xutao Deng; Darlan da Silva Cândido; Lea Campos de Oliveira; Jean-Noel Billaud; Marion C Lanteri; Vagner Oliveira-Carvalho Rigaud; Mark Seielstad; Jorge Kalil; Fabio Fernandes; Antonio Luiz P Ribeiro; Ester Cerdeira Sabino; Edecio Cunha-Neto
Journal: J Infect Dis Date: 2017-02-01 Impact factor: 5.226

7. YESdb: integrative analysis of environmental stress in yeast.

Authors: Evi Berchtold; Gergely Csaba; Ralf Zimmer
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

8. Autophagy accounts for approximately one-third of mitochondrial protein turnover and is protein selective.

Authors: Evelyn S Vincow; Ruth E Thomas; Gennifer E Merrihew; Nicholas J Shulman; Theo K Bammler; James W MacDonald; Michael J MacCoss; Leo J Pallanck
Journal: Autophagy Date: 2019-03-21 Impact factor: 16.016

9. Manually curated and harmonised transcriptomics datasets of psoriasis and atopic dermatitis patients.

Authors: Antonio Federico; Veera Hautanen; Nils Christian; Andreas Kremer; Angela Serra; Dario Greco
Journal: Sci Data Date: 2020-10-13 Impact factor: 6.444

10. curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome.

Authors: Benjamin Frederick Ganzfried; Markus Riester; Benjamin Haibe-Kains; Thomas Risch; Svitlana Tyekucheva; Ina Jazic; Xin Victoria Wang; Mahnaz Ahmadifar; Michael J Birrer; Giovanni Parmigiani; Curtis Huttenhower; Levi Waldron
Journal: Database (Oxford) Date: 2013-04-02 Impact factor: 3.451