Literature DB >> 17932051

Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata.

Jeremiah J Faith¹, Michael E Driscoll, Vincent A Fusaro, Elissa J Cosgrove, Boris Hayete, Frank S Juhn, Stephen J Schneider, Timothy S Gardner.

Abstract

Many Microbe Microarrays Database (M3D) is designed to facilitate the analysis and visualization of expression data in compendia compiled from multiple laboratories. M3D contains over a thousand Affymetrix microarrays for Escherichia coli, Saccharomyces cerevisiae and Shewanella oneidensis. The expression data is uniformly normalized to make the data generated by different laboratories and researchers more comparable. To facilitate computational analyses, M3D provides raw data (CEL file) and normalized data downloads of each compendium. In addition, web-based construction, visualization and download of custom datasets are provided to facilitate efficient interrogation of the compendium for more focused analyses. The experimental condition metadata in M3D is human curated with each chemical and growth attribute stored as a structured and computable set of experimental features with consistent naming conventions and units. All versions of the normalized compendia constructed for each species are maintained and accessible in perpetuity to facilitate the future interpretation and comparison of results published on M3D data. M3D is accessible at http://m3d.bu.edu/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17932051 PMCID： PMC2238822 DOI： 10.1093/nar/gkm815

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Microarrays, once a selectively used expensive tool, have become increasingly common due to their falling costs and increased credibility over the past 10 years. In contrast to the bulk of DNA sequencing, which has been taken over by large centers that automatically submit sequencing reads to centralized databases (e.g. GenBank), the majority of microarray expression data is still generated by smaller laboratories addressing particular biological questions. Given the diversity of expression possibilities in the cell and the stochastic nature of transcription and of microarrays themselves, previous studies have found computational analysis of large sets of microarrays (compendia) to be a powerful means of identifying strong biological signals between genes and across conditions (1–4). Historically these compendia have been generated as large internally controlled projects from a single laboratory, often excluding smaller datasets from independent laboratories. Yet, the many small microarray datasets generated worldwide represent a large and underutilized resource for genome-scale analyses such as compound mode of action identification (5) and network inference (6,7). The creators of the GEO database at NCBI (8) and the ArrayExpress database at EBI (9) have sought to address this opportunity by providing a central repository of expression data for large and small laboratories. While valuable first initiatives, the GEO and ArrayExpress databases are not yet structured in a way that facilitates efficient exploration or analysis of the data. Four main obstacles exist: first, submitting microarray datasets to repositories is more difficult than submitting sequences to GenBank—the data itself is more complicated, requiring submission formats that are beyond the means of many non-computational researchers. Thus a significant number of published array datasets have not been deposited. This problem has been addressed to some degree as more journals require submission of microarray data to GEO or ArrayExpress. The second obstacle is the presence of platform-specific biases in expression data due to the use of many different microarray platforms in a compendium. These biases obfuscate the interpretation of the integrated dataset. For dual-channel arrays, the situation is often further complicated by the lack of a single physiological reference condition used across all arrays in the platform. This lack of uniform reference prohibits some types of computational analyses. Hence, the first step in the analysis of an array compendium is often to segregate data into sets with a uniform reference condition and a consistent array type. This time-consuming step often reduces the compendium to a far less expansive dataset. The third obstacle is the lack of uniformity in the format of expression data, even within a single expression platform. Various software algorithms are available for preprocessing and normalizing the raw microarray intensity values. The data deposited in GEO and ArrayExpress does not necessarily employ a uniform preprocessing approach, nor is the raw intensity data always provided with the deposits. Thus, end-user performed preprocessing and normalization is precluded. The fourth obstacle is the incompleteness and inconsistency in the curation of metadata describing the details of each experimental condition. Each expression profile run for a given species can have a different genetic background, media, growth conditions and any number of chemicals, which might have an effect on the cell's expression. Such data is fundamental to the meaningful interpretation of expression data. Even when provided, this metadata is found as unstructured prose in the database deposit or in the methods sections of each publication. Ideally, this metadata would be collected in a computable format with uniform units across all laboratories. Although standards like MIAME (10) promote the human interpretation of experimental conditions, the standard is unevenly applied and it does not facilitate computational analysis. To address the latter three of these problems, we have constructed the Many Microbe Microarrays Database (M3D). M3D currently contains over 1000 microarrays for Escherichia coli (507), Saccharomyces cerevisiae (530) and Shewanella oneidensis (14), all of which were collected and combined from individual investigators, GEO (8), ArrayExpress (9) and ASAP (11). To avoid problems with platform-specific biases, M3D contains only single-channel Affymetrix microarrays. The expression data is uniformly normalized to enable web-based or offline (via a database dump) analysis without further user-dependent normalization. This facilitates analysis of the data across all laboratories and conditions, even by non-expert users. A set of web-based browsing and analysis tools is provided to facilitate efficient interrogation of the dataset without extensive computational skills. Raw intensity data files are also provided for all datasets for expert users. Importantly, experimental metadata in M3D is human curated from each microarray publication—converting each chemical and growth attribute into a structured and computable set of experimental features with consistent naming conventions and units. Finally, all versions of the database builds are maintained and accessible in perpetuity on the website to facilitate the future interpretation and comparison of results published on M3D data. The various attributes of M3D—comprehensive data and metadata, uniform normalization, access to raw data dumps, a computable structure, versioning of the database and web-based analysis tools—facilitate both efficient human interrogation of the dataset and machine-based computational analysis. Moreover, the consistency and uniformity of the dataset facilitates downstream comparison of results and findings based on the dataset.

SINGLE-PLATFORM, SINGLE-CHANNEL, UNIFORMLY NORMALIZED

Large microarray depositories like GEO and ArrayExpress focus on the archiving of expression data as used in specific publications. These archives play an essential role in biological science by allowing transparent replication of microarray analyses by other researchers. Experimenters using the same array platform often use different normalization methods for their analyses, so that data downloaded from different projects on GEO or ArrayExpress are unlikely to be directly comparable. GEO at NCBI provides GEO DataSets to alleviate this problem. A GEO DataSet contains a collection of biologically and statistically comparable microarray samples processed using the same platform. Unfortunately, there is a significant delay between when a sample is submitted to GEO and when it is available as a GEO DataSet. Only one-fifth of the number of samples in M3D were available from GEO DataSets (Figure 1A and B).

Figure 1.

All of the available E. coli Affymetrix Antisense2 expression data for the transcription factor lexA and its known target recA were downloaded from NCBI GEO Profiles (A) and from M3D compendium E_coli_v3_Build_1 (B and C). NCBI GEO Profile data is derived from NCBI GEO DataSets that contain only a subset of the data in GEO, therefore many more samples were available for plotting from M3D (445) than from GEO (85). The correlation between lexA and its known target was higher when the raw data was uniformly normalized with RMA (C) rather than normalizing each microarray individually with MAS5 (A and B). We have initially chosen to include only single-channel Affymetrix microarrays in M3D. The photolithography process used by Affymetrix allows all laboratories to start with a very consistent substrate for hybridization. In addition, the single-channel design eliminates the need for a common reference condition for all arrays. Thus, in contrast to two-color array designs, data from different laboratories and projects can be integrated without artifacts due to an inconsistent reference condition. The remaining systematic biases in the Affymetrix platform are due to researcher-specific differences in the RNA preparation and hybridization protocols. However, when the raw probe-level microarray data (CEL files) are normalized as a group with RMA (12), we find that these systematic researcher biases are small relative to the biological changes that occur across experimental conditions (7). In addition, the RMA normalized data tends to have higher correlation between the expression of transcription factors and their known targets (Figure 1B and C). To employ the RMA normalization approach in M3D, all expression profiles for a particular array design (e.g. the E. coli Antisense 2 array) are collected, uniformly normalized and deposited as a ‘build’. Periodically, we add new expression profiles for a particular array design, renormalize all data, and release a new ‘build’. This ensures that all experiments in any build are uniformly normalized and comparable across conditions. The renormalization process may result in small changes in the expression values of all profiles. Thus, all builds are labeled with a version number that references the underlying mysql schema of the database and a build number that denotes the particular set of microarray data (e.g. E_coli_v3_Build_2 uses mysql schema version 3 and is the second compendium built for E. coli). Builds are maintained in perpetuity. This system, like the build system used by the human genome assembly, allows computational researchers to specify the exact dataset used for a particular analysis.

CURATED, COMPUTABLE EXPERIMENTAL METADATA

The experimental condition information underlying each microarray sample is the most under-utilized aspect of compendia collected from multiple disparate sources. In scientific language there are typically multiple units that can be used to describe a particular aspect of an experiment. For example, the amount of glucose added to a media can be described in weight/volume, as a percent solution, or using molarity. To promote large-scale analyses of the relationship between experimental conditions and the expression values of each gene, we provide the quantitative and qualitative features of each experimental condition cataloged in a consistent framework suitable for computation. We use human curation to convert the condition metadata in each publication into consistent units and naming conventions, and we use computer validation to provide data integrity.

BULK DOWNLOADS

To facilitate large-scale computational analyses of compendium data, we provide bulk downloads of the normalized expression data in M3D. For each build, we provide separate files containing normalized data for all genes, all genes + intergenic regions, and all genes + intergenic regions + control probes. We also provide flat files containing the gene names, probe set names and curated experimental condition information. In addition, we provide the raw CEL files as a tar archive for researchers interested in using or developing other normalization methods.

ONLINE ANALYSIS, VISUALIZATION AND CUSTOM DATA DOWNLOADS

For more targeted analysis and data exploration, M3D allows the flexible construction, visualization and download of custom datasets. Users can select any subset of the experiments in M3D using checkboxes or by selecting ‘projects’ that represent larger groups of experiments (typically a project is the set of microarrays available in a single publication). Similarly, users can choose a subset of genes by typing or uploading a list of gene and/or probe names. Genes can also be selected by differential expression as measured by t-test, z-test or fold change (e.g. choose all genes with a significant expression change between experiments A,C,E versus B,D,F,G as measured by a t-test with a user-chosen significance threshold). Once a user selects a set of genes and experiments, the data can be downloaded or visualized. Although, there are many existing general plotting tools and a few software visualization products dedicated to microarrays, it is often convenient to be able to choose a few conditions of interest, type in a few genes and see a quick plot of the data. M3D currently provides heat plots (with and without clustering), expression histograms (for individual genes and groups of genes), scatter plots and a genome browser for visualization of expression in a genome context (Figure 2) (13).

Figure 2.

Custom datasets constructed on M3D can be visualized with scatterplots (A), histograms of individual genes (B), heatplots (C), histograms of collections of genes (D) and in their genome context using a genome browser (E). All browsing, analytical and download features of M3D are accessed from the same page, the Analysis page, on the website. This page guides the user step-by-step through the process of database selection, experiment selection, gene selection, visualization, analysis and download. At each step, a user's selections are saved in a cookie, enabling the user to return to and modify any prior selection without losing any other selections. The user can also select ‘Start Over’ from any point in the analysis to clear all selections. Context-specific help is provided on each page by mousing over the ‘[?]'symbol.

REMOTELY ACCESSIBLE VISUALIZATION

The power of the Internet resides in its interconnectivity. Biological databases like NCBI (14), Ecocyc (15) and RegulonDB (16) provide easy linking mechanisms so that other databases can automatically generate hyperlinks to their content. M3D provides a simple mechanism for generating links to M3D GenePages, which contain a basic set of plots for each gene. In addition all of the plots on M3D are generated using a simple URL syntax so that websites can easily include plots generated by M3D on their own sites. For example, strong correlation is often found between the expression of a transcription factor and its targets, a website cataloging known or predicted regulatory interactions might find it useful to provide a scatter plot of the transcription factor's expression versus its target gene's expression to allow users to see if expression data currently supports the regulatory interaction. The URL syntax for this and other plots can be found by clicking the help tab on the main menu of M3D.

UPDATE AND DATA SUBMISSION PROCEDURES

Raw CEL files are periodically collected from the ArrayExpress and GEO microarray databases. Upon accumulation of approximately 50 new chips for a particular species, all of the old and new microarrays are normalized together into a new compendium build. For researchers preferring to submit CEL files directly to M3D, we can generate a template submission to GEO, which the researcher can then edit as desired.

16 in total

1. Functional discovery via a compendium of expression profiles.

Authors: T R Hughes; M J Marton; A R Jones; C J Roberts; R Stoughton; C D Armour; H A Bennett; E Coffey; H Dai; Y D He; M J Kidd; A M King; M R Meyer; D Slade; P Y Lum; S B Stepaniants; D D Shoemaker; D Gachotte; K Chakraburtty; J Simon; M Bard; S H Friend
Journal: Cell Date: 2000-07-07 Impact factor: 41.582

2. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

3. Network component analysis: reconstruction of regulatory signals in biological systems.

Authors: James C Liao; Riccardo Boscolo; Young-Lyeol Yang; Linh My Tran; Chiara Sabatti; Vwani P Roychowdhury
Journal: Proc Natl Acad Sci U S A Date: 2003-12-12 Impact factor: 11.205

4. Reverse engineering of regulatory networks in human B cells.

Authors: Katia Basso; Adam A Margolin; Gustavo Stolovitzky; Ulf Klein; Riccardo Dalla-Favera; Andrea Califano
Journal: Nat Genet Date: 2005-03-20 Impact factor: 38.330

5. ASAP, a systematic annotation package for community analysis of genomes.

Authors: Jeremy D Glasner; Paul Liss; Guy Plunkett; Aaron Darling; Tejasvini Prasad; Michael Rusch; Alexis Byrnes; Michael Gilson; Bryan Biehl; Frederick R Blattner; Nicole T Perna
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions.

Authors: Heladia Salgado; Socorro Gama-Castro; Martín Peralta-Gil; Edgar Díaz-Peredo; Fabiola Sánchez-Solano; Alberto Santos-Zavaleta; Irma Martínez-Flores; Verónica Jiménez-Jacinto; César Bonavides-Martínez; Juan Segura-Salazar; Agustino Martínez-Antonio; Julio Collado-Vides
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2006-12-05 Impact factor: 16.971

8. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles.

Authors: Jeremiah J Faith; Boris Hayete; Joshua T Thaden; Ilaria Mogno; Jamey Wierzbowski; Guillaume Cottarel; Simon Kasif; James J Collins; Timothy S Gardner
Journal: PLoS Biol Date: 2007-01 Impact factor: 8.029

9. EcoCyc: a comprehensive database resource for Escherichia coli.

Authors: Ingrid M Keseler; Julio Collado-Vides; Socorro Gama-Castro; John Ingraham; Suzanne Paley; Ian T Paulsen; Martín Peralta-Gil; Peter D Karp
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks.

Authors: Diego di Bernardo; Michael J Thompson; Timothy S Gardner; Sarah E Chobot; Erin L Eastwood; Andrew P Wojtovich; Sean J Elliott; Scott E Schaus; James J Collins
Journal: Nat Biotechnol Date: 2005-03 Impact factor: 54.908

118 in total

1. Bayesian Joint Modeling of Multiple Gene Networks and Diverse Genomic Data to Identify Target Genes of a Transcription Factor.

Authors: Peng Wei; Wei Pan
Journal: Ann Appl Stat Date: 2012-01-01 Impact factor: 2.083

2. Genomic arrangement of bacterial operons is constrained by biological pathways encoded in the genome.

Authors: Yanbin Yin; Han Zhang; Victor Olman; Ying Xu
Journal: Proc Natl Acad Sci U S A Date: 2010-03-22 Impact factor: 11.205

3. Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis.

Authors: Sriram Chandrasekaran; Nathan D Price
Journal: Proc Natl Acad Sci U S A Date: 2010-09-27 Impact factor: 11.205

4. Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.

Authors: Michael Luke Carlson; R Greg Stacey; John William Young; Irvinder Singh Wason; Zhiyu Zhao; David G Rattray; Nichollas Scott; Craig H Kerr; Mohan Babu; Leonard J Foster; Franck Duong Van Hoa
Journal: Elife Date: 2019-07-31 Impact factor: 8.140

5. Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia.

Authors: Elissa J Cosgrove; Yingchun Zhou; Timothy S Gardner; Eric D Kolaczyk
Journal: Bioinformatics Date: 2008-09-08 Impact factor: 6.937

6. Identity and function of a large gene network underlying mutagenic repair of DNA breaks.

Authors: Abu Amar M Al Mamun; Mary-Jane Lombardo; Chandan Shee; Andreas M Lisewski; Caleb Gonzalez; Dongxu Lin; Ralf B Nehring; Claude Saint-Ruf; Janet L Gibson; Ryan L Frisch; Olivier Lichtarge; P J Hastings; Susan M Rosenberg
Journal: Science Date: 2012-12-07 Impact factor: 47.728

7. Query large scale microarray compendium datasets using a model-based bayesian approach with variable selection.

Authors: Ming Hu; Zhaohui S Qin
Journal: PLoS One Date: 2009-02-13 Impact factor: 3.240

8. Patterns of subnet usage reveal distinct scales of regulation in the transcriptional regulatory network of Escherichia coli.

Authors: Carsten Marr; Fabian J Theis; Larry S Liebovitch; Marc-Thorsten Hütt
Journal: PLoS Comput Biol Date: 2010-07-01 Impact factor: 4.475

9. Model-based redesign of global transcription regulation.

Authors: Javier Carrera; Guillermo Rodrigo; Alfonso Jaramillo
Journal: Nucleic Acids Res Date: 2009-02-02 Impact factor: 16.971

10. Meta Analysis of Gene Expression Data within and Across Species.

Authors: Ana C Fierro; Filip Vandenbussche; Kristof Engelen; Yves Van de Peer; Kathleen Marchal
Journal: Curr Genomics Date: 2008-12 Impact factor: 2.236