Literature DB >> 17099226

NCBI GEO: mining tens of millions of expression profiles--database and tools update.

Tanya Barrett¹, Dennis B Troup, Stephen E Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista, Irene F Kim, Alexandra Soboleva, Maxim Tomashevsky, Ron Edgar.

Abstract

The Gene Expression Omnibus (GEO) repository at the National Center for Biotechnology Information (NCBI) archives and freely disseminates microarray and other forms of high-throughput data generated by the scientific community. The database has a minimum information about a microarray experiment (MIAME)-compliant infrastructure that captures fully annotated raw and processed data. Several data deposit options and formats are supported, including web forms, spreadsheets, XML and Simple Omnibus Format in Text (SOFT). In addition to data storage, a collection of user-friendly web-based interfaces and applications are available to help users effectively explore, visualize and download the thousands of experiments and tens of millions of gene expression patterns stored in GEO. This paper provides a summary of the GEO database structure and user facilities, and describes recent enhancements to database design, performance, submission format options, data query and retrieval utilities. GEO is accessible at http://www.ncbi.nlm.nih.gov/geo/

Entities: Chemical Disease Gene Species

Mesh：

Year: 2006 PMID： 17099226 PMCID： PMC1669752 DOI： 10.1093/nar/gkl887

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Microarray and other high-throughput technologies have led to an explosion in the rate of molecular abundance data generated in the last decade. For the last seven years the Gene Expression Omnibus (GEO) database has served as a central hub for these data, operating primarily as a public archive and distribution center, but also providing flexible mining tools that enable users to easily query, filter, inspect and download data in the context of their specific interests (1,2). GEO is currently the largest fully public gene expression resource. Since its inception, the database has grown exponentially each year. As of September 2006, the database holds over 120 000 samples, representing over 3.2 billion individual measurements, spanning over 200 organisms, and addressing a wide variety of biological phenomena. These data have been deposited by >2000 laboratories from around the world. All data are freely available online and via bulk FTP download. GEO supports minimum information about a microarray experiment (MIAME)-compliant data submissions. MIAME is a data content standard developed by the microarray gene expression data (MGED) society to outline what information should be provided when describing a microarray experiment (3). Making microarray data public in a MIAME-compliant manner has become a precondition for publication for many journals. Publishing original data and protocols facilitates independent evaluation of results and reanalysis, and is in keeping with the spirit of open-access (4). Consequently, most of the data in GEO have been submitted by the research community in fulfillment of journal requirements.

DATABASE STRUCTURE AND DATA FLOW

The GEO database architecture is designed for the efficient capture, storage and retrieval of large-scale functional genomic data. The diverse and complex nature of such data presents considerable challenges in data handling and querying. There are many different types of high-throughput methodologies and researchers use a wide variety of hardware and software to generate and process data. Thus, data come in many different formats and comprise varying content. Furthermore, technologies and processing strategies continue to rapidly evolve. In light of these considerations, GEO was designed with a flexible structure that can accommodate diverse styles of data. This flexibility is largely attributed to the fact that tabular data are not fully granulated in the core database but instead are treated as plain text, tab-delimited tables that may contain any number of rows or columns. Although the primary database has no knowledge and applies no restrictions on these tab-delimited tables, some columns reserve special meanings and data from selected fields are extracted to secondary databases and used in downstream query and analysis applications. Accompanying supplementary and native file types are linked from each record and stored on an FTP server. Expression data can be rendered meaningless unless accompanied by the contextual biological and processing details under which they were generated. To address this, GEO has a MIAME-compliant infrastructure that supports fully annotated records. Biological and other descriptive metadata are stored in designated fields with proper relations or restrictions within database tables.

Submitter-supplied data

The overall structure of the core GEO database remains as described previously (1,2). Briefly, data submitted to GEO are stored in a relational MSSQL database partitioned into three entity types:

Platform

Includes a summary description of the array and a data table defining the array template. Each row in the table corresponds to a single feature, and includes sequence annotation and tracking information as provided by the submitter. The table may contain any number of columns allowing thorough annotation of the array.

Sample

Includes a description of the biological material and the experimental protocols to which it was subjected, and a data table containing hybridization measurements for each feature on the corresponding platform. The table may contain any number of columns in which to comprehensively present hybridization results. The metadata fields may hold very large volumes of text to allow elaborate descriptions of the biological source and protocols.

Series

Defines a set of related samples considered to be part of a study, and describes the overall study aim and design. Series may also incorporate tabular summary tables pertaining to the experiment as a whole. Each of these objects is essentially under the submitter's editorial control and is assigned a stable and unique accession number that may be used to cite and retrieve the records. The accession consists of a number and a letter prefix indicating whether the record is a GEO Platform (GPL), GEO Sample (GSM), or GEO Series (GSE). In addition to the user-submitted objects described above, GEO defines and creates a number of related data objects to facilitate data mining, visual rendering and transposition of submitted data into alternative structures. The principal object used for this purpose is the DataSet object.

GEO DataSets

Despite the variety of style and content of the data received, submissions have a common core set of elements: Using a combination of automated data extraction and manual curation, this information is taken from the submitter-supplied records and organized into an upper-level object called a GEO DataSet. A DataSet represents a collection of similarly-processed experimentally-related samples, summarized and categorized according to experimental variables. DataSets allow for the transformation of diverse styles of incoming data from multiple unrelated projects, into a relatively standardized format upon which downstream data analysis and data display tools are based. sequence identity tracking information of each feature on the array normalized hybridization measurements a description of the biological source used in each hybridization DataSets provide two discrete renderings of the data (Figure 1):

Figure 1

A selection of GEO screenshots from a typical experiment (GEO DataSet GDS877; 16). (A) DataSet record includes experiment summary information, DataSet subset classifications, and access to data mining features such as hierarchical cluster heat map and ‘Query subset A versus B’ tool. (B) DataSet hierarchical cluster heat map calculated by un-centered correlation coefficient/average linkage option. Regions of interest can be selected using the red image cropper box, then either expanded to view sample and gene annotation, downloaded, charted as line plots, or linked directly to corresponding Entrez GEO profiles records. (C) GEO profiles retrieval results; each entity includes sequence identifier and DataSet information, and a thumbnail profile image. (D) Expanded profile chart depicts expression value information for the crystallin gene across each sample in DataSet GDS877. Experimental subset groupings are reflected in labels at foot of chart.

An experiment-centered representation that encapsulates the entire study. This information is presented as a DataSet record which comprises a synopsis of the experiment, a breakdown of the experimental variables, access to auxiliary objects, several data display and analysis tools, and download options. A gene-centered representation that presents quantitative gene expression measurements for one gene across a DataSet. This information is presented as a GEO Profile which comprises gene identity annotation, DataSet title, links to auxiliary information and a chart depicting the expression level and rank of that gene across each sample in the DataSet. Gene annotation is derived from querying sequence identifiers (e.g. GenBank accessions, clone IDs) with the latest Entrez Gene and UniGene databases, an important point given the dynamic nature of gene annotation. A selection of GEO screenshots from a typical experiment (GEO DataSet GDS877; 16). (A) DataSet record includes experiment summary information, DataSet subset classifications, and access to data mining features such as hierarchical cluster heat map and ‘Query subset A versus B’ tool. (B) DataSet hierarchical cluster heat map calculated by un-centered correlation coefficient/average linkage option. Regions of interest can be selected using the red image cropper box, then either expanded to view sample and gene annotation, downloaded, charted as line plots, or linked directly to corresponding Entrez GEO profiles records. (C) GEO profiles retrieval results; each entity includes sequence identifier and DataSet information, and a thumbnail profile image. (D) Expanded profile chart depicts expression value information for the crystallin gene across each sample in DataSet GDS877. Experimental subset groupings are reflected in labels at foot of chart.

SUBMISSION PROCEDURES, FORMATS AND STANDARDS

We endeavor to make data deposit procedures as straightforward as possible. Submitters have several options for data submission; selecting which method to use depends on the amount and type of data to be submitted, and what format the data are already in. Regardless of the deposit method chosen, the final GEO records will look similar and contain equivalent information. Each format captures all components of the MIAME checklist, as well as any additional information that the submitter wants to provide.

Upload options and formats

Web deposit

The web submission process is designed for quick and easy deposit of individual records by occasional submitters, or for smaller experiments. This route consists of a set of interactive web forms that provide a simple step-by-step procedure for deposit of data tables and accompanying descriptive information.

SOFT format

Simple Omnibus Format in Text (SOFT) is a simple, line-based, tab-delimited format designed for rapid batch deposit. A single SOFT file can hold both data tables and accompanying descriptive information for multiple platforms, samples and series records. The simplicity of SOFT allows it to be readily generated from commonly-used database and spreadsheet applications. Conveniently, two versions of SOFT are available:

SOFTtext

SOFT-formatted data are organized as concatenated records.

SOFTmatrix

SOFT-formatted data are organized side-by-side as a matrix table, usually in an Excel spreadsheet.

MINiML format

MIAME Notation in Markup Language, (MINiML, pronounced ‘minimal’) is a recent addition to GEO's upload/download options. MINiML is effectively an XML rendering of SOFT format, and is similarly designed for rapid batch submission and upload of data. The MINiML XML schema definition and a detailed description are available at the GEO website.

MAGE-ML format

MicroArray Gene Expression Markup Language (MAGE-ML) is an XML format devised by the MGED consortium (5) and a direct derivation from the corresponding MAGE object model. GEO is not based on the MAGE object model and cannot receive these files directly. Nonetheless, parsers have been written to extract data from some of the various flavors of MAGE-ML and reformat according to GEO schema. It is worth noting here that having data formatted as MAGE-ML does not in any way imply MIAME-compliance. MIAME is a data content standard, not a format standard. MIAME-compliant data may be submitted in many formats. Detailed documentation and examples of submission options and formats are available on the GEO website. However, if submitters have questions or require assistance with submission procedures they are encouraged to contact GEO curation staff at geo@ncbi.nlm.nih.gov for prompt support. Submitters may keep their records private until a manuscript describing the data is published. Submitters may generate read-only passwords that give reviewers and collaborators confidential access to their private data. Most researchers submit to GEO to support data discussed in a journal manuscript, so it is important to present the data as it was processed in the manuscript. However, over the past two years, greater emphasis has been placed on provision of raw, unmanipulated native data files to accompany the processed data within GEO records. Such files include, e.g. Affymetrix CEL or GenePix GPR scan files. Recent modifications to submission procedures now make it more convenient for submitters to supply these raw files: the web deposit route specifically requests supplementary files; the batch deposit routes allow for raw data files to be zipped/tarred together with bulk submissions. Provision of raw data not only enables other researchers to faithfully reproduce the data selection, transformation and analysis steps that are the basis of a publication, but also maximizes the long-term value of submissions, enabling recycling of the data into repeated rounds of analysis. All submitted data undergo syntactic validation and are inspected by curators for content integrity. When content or format problems are identified, curators work with the submitter until the issue is resolved. However, given the huge diversity of biological themes, technology types, processing techniques, and statistical transformations applied to microarray data, it is impractical for curators to decisively determine the accuracy, validity or score the degree of MIAME-compliance of submitted data. Thus, researchers are ultimately responsible for the completeness, quality and accuracy of their submissions. This validation process can benefit from feedback by journal editorial reviewers or funding agency enforcement. Through their GEO accounts, researchers retain full editorial control of their records and can update or edit their records at any time. In addition to satisfying possible journal requirements for publication, there are other significant benefits to depositing data with GEO. Data receive long term archiving at a centralized repository, integration with other NCBI resources which afford greatly increased usability and visibility, as well as possible links back to submitters' own project websites.

TOOLS TO RETRIEVE, EXPLORE AND VIZUALIZE DATA

To maximize the utility and value of the massive volumes of data in GEO, a selection of intuitive tools and features has been developed to assist researchers to quickly locate, analyze and visualize data relevant to their interests. These features incorporate traditional data reduction techniques and concise displays designed for human scanning, helping the user identify and categorize gene and sample relationships. Figure 2 depicts a schematic overview of the query workflow and how the various features and tools are interlinked. A summary of where the main features are located and their purpose is provided in Table 1. Query approaches include standard text-based searches, sequence-based searches, mining based on expression behavior characteristics or combinations of these factors.

Figure 2

A schematic overview of query workflow, and how various features and tools are interlinked. A description of the location and purpose of these features is provided in Table 1.

Table 1

Summary of location and purpose of various GEO data mining tools and features

A schematic overview of query workflow, and how various features and tools are interlinked. A description of the location and purpose of these features is provided in Table 1. Summary of location and purpose of various GEO data mining tools and features These tools do not require specialized knowledge of microarray analysis methods, nor do they require time-consuming download or processing of large data sets. However, it should be stated that the analysis features are not primarily intended for robust systematic data mining. The diverse nature of the data in GEO restricts to some extent the statistical tools that can be developed. All data are treated similarly; criteria such as scaling factors, filter parameters, and number of repeats are not considered. Despite these issues, these tools are extremely useful for quick and easy identification of relevant and noteworthy data. NCBI's Entrez search system serves as the basis for most queries. Entrez GEO DataSets contains experiment-centered data and Entrez GEO Profiles contains gene-centered data. Most biologists are familiar with Entrez, using it routinely to search other NCBI databases like PubMed and GenBank (6,7). It has a straightforward interface where users can locate relevant material by simply typing in keywords or Boolean phrases restricted to supported attribute fields. Examples of typical queries and query fields are provided at . Full use is made of Entrez's powerful linking capabilities. Intra-database links connect genes related by expression pattern or sequence. Where possible, reciprocal inter-database links connect GEO data with related data in other NCBI resources such as PubMed, GenBank, Gene, UniGene, MapViewer, OMIM and others. Advanced Entrez features allow generation of complex multipart queries or combination of multiple queries that find common intersections in retrievals. GEO's Entrez query facilities were recently further enhanced by implementation of a spell-check function, as well as automatic term mapping using MeSH translation tables. Graphics are an important tool to aid visualization and interpretation of high-dimensional expression data. The expression pattern of each gene within a DataSet is represented as a profile chart (Figure 1D). A breakdown of the experimental design is provided along the bottom of the chart, helping the user to quickly assess whether expression levels are shifting with experimental variables. Thumbnail chart images provided on batch profile retrievals are useful for rapid batch profile scanning and comparison. Value distribution charts are provided on DataSets records, providing at-a-glance indication of how well normalized the data are within a DataSet. Precomputed interactive hierarchical cluster heat map images are available on each DataSet record, providing suggestions for groups of coordinately regulated genes within entire DataSets. Within the last year, the back-end structure of the Profiles, DataSets and annotation databases was completely redesigned. These changes allow more flexibility on the front-end user interfaces and will permit development of more advanced query, analysis and download tools, including enhanced Entrez utilities user-scripting options. These changes also help to streamline internal indexing procedures, enabling more frequent release of new DataSets and profiles. For users who prefer to use their own analysis software or want to perform more robust analyses, all GEO data are available for bulk download via anonymous FTP at . Files include SOFT- and MINiML-formatted Platform and Series families, SOFT-formatted DataSets and original supplementary data types. Various software packages have been developed by the community to handle GEO data formats, including the GEOquery R/BioConductor package, .

CONCLUSIONS

GEO currently represents the largest single resource for public gene expression data. Beyond archiving and making data freely-available for peer review and download, the GEO repository also provides an extensive complement of utilities and strategies that enable effective data mining on either a small or large scale. The data in GEO gain value as they accumulate. Pooling masses of expression data into common formats at a single location affords researchers the opportunity to distill disparate data sets and identify common gene expression trends, dissect regulatory networks and predict functions of uncharacterized genes. Increasingly, GEO data are used and cited by third parties as evidence to support and complement their own studies, selected examples include (8–15). Having GEO data cross-annotated with extensive sequence, mapping and bibliographic resources via the NCBI Entrez system of interlinked databases imparts further value and context to the data. This diverse integrated data environment leverages multiple types of information and enables traditional disciplinary boundaries to be crossed, ultimately accelerating systems-level hypothesis formation and scientific discovery. Future plans for GEO include continued development of data retrieval and mining features, and enhancing novice user experience. We also plan to improve rendering and representation of the non-gene-expression data types that GEO accepts, which include chromatin-immunoprecipitation on arrays (ChIP-chip) studies, array comparative genomic hybridization (aCGH), SNP arrays and some proteomic data.

16 in total

1. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

2. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

Authors: Ron Edgar; Michael Domrachev; Alex E Lash
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

3. Freshly isolated rat alveolar type I cells, type II cells, and cultured type II cells have distinct molecular phenotypes.

Authors: Robert Gonzalez; Yee Hwa Yang; Chandi Griffin; Lennell Allen; Zachary Tigue; Leland Dobbs
Journal: Am J Physiol Lung Cell Mol Physiol Date: 2004-09-24 Impact factor: 5.464

4. Functional annotation and network reconstruction through cross-platform integration of microarray data.

Authors: Xianghong Jasmine Zhou; Ming-Chih J Kao; Haiyan Huang; Angela Wong; Juan Nunez-Iglesias; Michael Primig; Oscar M Aparicio; Caleb E Finch; Todd E Morgan; Wing Hung Wong
Journal: Nat Biotechnol Date: 2005-01-16 Impact factor: 54.908

5. Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses.

Authors: Obi L Griffith; Erin D Pleasance; Debra L Fulton; Mehrdad Oveisi; Martin Ester; Asim S Siddiqui; Steven J M Jones
Journal: Genomics Date: 2005-10 Impact factor: 5.736

6. Entrez: molecular biology database and retrieval system.

Authors: G D Schuler; J A Epstein; H Ohkawa; J A Kans
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

7. Standards for microarray data: an open letter.

Authors: Catherine Ball; Alvis Brazma; Helen Causton; Steve Chervitz; Ron Edgar; Pascal Hingamp; John C Matese; Helen Parkinson; John Quackenbush; Martin Ringwald; Susanna-Assunta Sansone; Gavin Sherlock; Paul Spellman; Christian Stoeckert; Yoshio Tateno; Ronald Taylor; Joseph White; Neil Winegarden
Journal: Environ Health Perspect Date: 2004-08 Impact factor: 9.031

8. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Stephen T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. NCBI GEO: mining millions of expression profiles--database and tools.

Authors: Tanya Barrett; Tugba O Suzek; Dennis B Troup; Stephen E Wilhite; Wing-Chi Ngau; Pierre Ledoux; Dmitry Rudnev; Alex E Lash; Wataru Fujibuchi; Ron Edgar
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Design and implementation of microarray gene expression markup language (MAGE-ML).

Authors: Paul T Spellman; Michael Miller; Jason Stewart; Charles Troup; Ugis Sarkans; Steve Chervitz; Derek Bernhart; Gavin Sherlock; Catherine Ball; Marc Lepage; Marcin Swiatek; W L Marks; Jason Goncalves; Scott Markel; Daniel Iordan; Mohammadreza Shojatalab; Angel Pizarro; Joe White; Robert Hubley; Eric Deutsch; Martin Senger; Bruce J Aronow; Alan Robinson; Doug Bassett; Christian J Stoeckert; Alvis Brazma
Journal: Genome Biol Date: 2002-08-23 Impact factor: 13.583

648 in total

1. gViz, a novel tool for the visualization of co-expression networks.

Authors: Raphaël Helaers; Eric Bareke; Bertrand De Meulder; Michael Pierre; Sophie Depiereux; Naji Habra; Eric Depiereux
Journal: BMC Res Notes Date: 2011-10-27

2. Construction of regulatory networks using expression time-series data of a genotyped population.

Authors: Ka Yee Yeung; Kenneth M Dombek; Kenneth Lo; John E Mittler; Jun Zhu; Eric E Schadt; Roger E Bumgarner; Adrian E Raftery
Journal: Proc Natl Acad Sci U S A Date: 2011-11-14 Impact factor: 11.205

Review 3. Proteome-wide prediction of protein-protein interactions from high-throughput data.

Authors: Zhi-Ping Liu; Luonan Chen
Journal: Protein Cell Date: 2012-06-22 Impact factor: 14.870

4. Gemma: a resource for the reuse, sharing and meta-analysis of expression profiling data.

Authors: Anton Zoubarev; Kelsey M Hamer; Kiran D Keshav; E Luke McCarthy; Joseph Roy C Santos; Thea Van Rossum; Cameron McDonald; Adam Hall; Xiang Wan; Raymond Lim; Jesse Gillis; Paul Pavlidis
Journal: Bioinformatics Date: 2012-07-10 Impact factor: 6.937

5. Systems biology in heart diseases.

Authors: G E Louridas; I E Kanonidis; K G Lourida
Journal: Hippokratia Date: 2010-01 Impact factor: 0.471

Review 6. Coexpression landscape in ATTED-II: usage of gene list and gene network for various types of pathways.

Authors: Takeshi Obayashi; Kengo Kinoshita
Journal: J Plant Res Date: 2010-04-10 Impact factor: 2.629

7. ChIP sequencing of cyclin D1 reveals a transcriptional role in chromosomal instability in mice.

Authors: Mathew C Casimiro; Marco Crosariol; Emanuele Loro; Adam Ertel; Zuoren Yu; William Dampier; Elizabeth A Saria; Alex Papanikolaou; Timothy J Stanek; Zhiping Li; Chenguang Wang; Paolo Fortina; Sankar Addya; Aydin Tozeren; Erik S Knudsen; Andrew Arnold; Richard G Pestell
Journal: J Clin Invest Date: 2012-02-06 Impact factor: 14.808

8. Integrative analysis correlates donor transcripts to recipient autoantibodies in primary graft dysfunction after lung transplantation.

Authors: Peter H Hagedorn; Christopher M Burton; Eli Sahar; Eytan Domany; Irun R Cohen; Henrik Flyvbjerg; Martin Iversen
Journal: Immunology Date: 2010-11-11 Impact factor: 7.397

Review 9. Data-driven methods to discover molecular determinants of serious adverse drug events.

Authors: A P Chiang; A J Butte
Journal: Clin Pharmacol Ther Date: 2009-01-28 Impact factor: 6.875

10. Gene expression profiling suggests primary central nervous system lymphomas to be derived from a late germinal center B cell.

Authors: M Montesinos-Rongen; A Brunn; S Bentink; K Basso; W K Lim; W Klapper; C Schaller; G Reifenberger; J Rubenstein; O D Wiestler; R Spang; R Dalla-Favera; R Siebert; M Deckert
Journal: Leukemia Date: 2007-11-08 Impact factor: 11.528