Literature DB >> 18842599

GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus.

Yuelin Zhu1, Sean Davis, Robert Stephens, Paul S Meltzer, Yidong Chen.   

Abstract

UNLABELLED: The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R. AVAILABILITY: The web interface and SQLite databases available at http://gbnci.abcc.ncifcrf.gov/geo/. The Bioconductor package is available via the Bioconductor project. The corresponding MATLAB implementation is also available at the same website.

Entities:  

Mesh:

Year:  2008        PMID: 18842599      PMCID: PMC2639278          DOI: 10.1093/bioinformatics/btn520

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The NCBI Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) represents the largest public repository of microarray data in existence (Edgar et al., 2002; Barrett et al., 2007). The Bioconductor project (Gentleman et al., 2004) contains hundreds of state-of-the-art methods for the analysis of microarray and genomics data. Previously we published software, called GEOquery (Davis and Meltzer, 2007), which effectively establishes a bridge between GEO microarray data and Bioconductor and facilitates reanalysis using novel and rigorous statistical and bioinformatic tools. However, a difficulty that remains in dealing with GEO is to find, based on the experimental metadata, the microarray data that are of interest especially for large-scale and programmatic access of GEO data. As part of the NCBI Entrez search system, GEO can be searched online via web pages or using NCBI eUtils (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). However, the NCBI/GEO web search is not yet full featured, particularly for programmatic access. NCBI eUtils offers another option for finding data within the vast stores of GEO, but it is cumbersome to use, often requiring multiple complicated eUtils queries to get the relevant information. GEOmetadb was developed in an attempt to make querying the GEO metadata both easier and more effective. GEOmetadb includes a web-based query engine, supported by a MySQL database backend, with several convenient utilities and a Bioconductor package, called GEOmetadb, which queries a locally installed GEOmetadb SQLite database that we update regularly and supply for download; each can be used independently of the other.

2 RESULTS

2.1 GEO metadata parsing

GEO has an open, adaptable design that can handle variety and a minimum information about a microarray experiment (MIAME)-compliant (Brazma et al., 2001) infrastructure that promotes fully annotated submissions. The basic record types in GEO include Platforms (GPL), Samples (GSM), Series (GSE) and DataSets (GDS), of which GDS records are assembled by GEO curators and others are supplied by submitters. Essentially, information in each GEO record can be divided into two parts, a metadata part and the data part. The information in metadata part is critical for finding GEO microarray data of interest. NCBI offers several different methods to access GEO records, which we utilize to capture all GEO metadata for different GEO data types accordingly. Hypertext preprocessor (PHP, http://www.php.net) functions were written to parse, extract, reformat, construct data elements and interact with a MySQL database (http://www.mysql.com/) for storage and querying. The PHP function for parsing GDS SOFT files was adopted from the EzArray software (Zhu et al., 2008). The GEOmetadb MySQL database was designed to store parsed GEO metadata and relationships between them (Fig. 1). All data in GEOmetadb are faithfully parsed from GEO and no attempt is made to curate, semantically recode, or otherwise clean up GEO data. All field names are also taken from GEO records except for minor changes to improve usability in SQL queries. Fields containing multiple records are generally stored as delimited text within the same record; this denormalization significantly reduces complexity and improves efficiency of queries. SQLite 3 database (http://www.sqlite.org/) is a widely used, cross-platform SQL database engine which is a self-contained, embeddable, serverless, transactional SQL database engine. The RSQLite package (James, 2008) includes an embedded SQLite database engine and can interact with any SQLite database; each database exists as a simple file, which is easily exchanged and is platform independent. An R script converts the GEOmetadb MySQL database to an SQLite 3 database file that contains data identical to those in the GEOmetadb MySQL database. The SQLite version of GEOmetadb is maintained and distributed for local installation.
Fig. 1.

Diagram of GEO entity relationships in GEOmetadb.

Diagram of GEO entity relationships in GEOmetadb.

2.2 GEOmetadb bioconductor package

The GEOmetadb Bioconductor package is simply a thin wrapper around the GEOmetadb SQLite database. The package also includes extensive documentation and example queries. The function getSQLiteFile is the standard method for downloading and unzipping the most recent GEOmetadb SQLite file from the server. The function geoConvert performs conversion of one GEO entity to other associated GEO entities, providing a very fast, convenient mapping between GEO types. To convert ‘GPL96’ to other possible GEO entities in the GEOmetadb.sqlite: The example provided below utilizes RSQLite function dbGetQuery to extract all affymetrix GeneChips that have .CEL supplementary submission to GEO.

2.3 The GEOmetadb online search tool

The GEOmetadb online search tool is a web-based search interface for searching, viewing and downloading GEO metadata stored in the GEOmetadb MySQL database. GEO metadata records can be searched by individual data type or by a flexible, efficient, powerful combined GSE-GPL-GSM search, as shown in Figure 2, where GEO entities in the tables are linked by relationships between them. Essential query fields are provided with drop-down menu for popular entries, and keyword search for full-text querying from multiple text fields in GEO. Other features include multiple field query, query within results, creating lists, flexible display options, downloading and detailed views of any results.
Fig. 2.

Screen-capture of GEOmetadb online search: combined GSE-GPL-GSM search.

Screen-capture of GEOmetadb online search: combined GSE-GPL-GSM search.

3 CONCLUSIONS

With the continued growth in the volume and complexity of microarray data available via NCBI GEO, it is critical that researchers have efficient, flexible, powerful methods for querying those data. While GEO offers several options for finding microarray data, GEOmetadb provides an alternative, yet much more flexible and efficient, set of tools for both online and programmatic access to GEO metadata. We expect that improved access to GEO metadata will not only enhance researchers’ abilities to find data of interest, but also provide a possibility for users to create a customized GEO metadata database, e.g. annotating experiments with controlled vocabulary and integrating with other biological data sources.
  6 in total

1.  Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors:  A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal:  Nat Genet       Date:  2001-12       Impact factor: 38.330

2.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

Authors:  Ron Edgar; Michael Domrachev; Alex E Lash
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

3.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.

Authors:  Sean Davis; Paul S Meltzer
Journal:  Bioinformatics       Date:  2007-05-12       Impact factor: 6.937

4.  Bioconductor: open software development for computational biology and bioinformatics.

Authors:  Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal:  Genome Biol       Date:  2004-09-15       Impact factor: 13.583

5.  NCBI GEO: mining tens of millions of expression profiles--database and tools update.

Authors:  Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Dmitry Rudnev; Carlos Evangelista; Irene F Kim; Alexandra Soboleva; Maxim Tomashevsky; Ron Edgar
Journal:  Nucleic Acids Res       Date:  2006-11-11       Impact factor: 16.971

6.  EzArray: a web-based highly automated Affymetrix expression array data management and analysis system.

Authors:  Yuerong Zhu; Yuelin Zhu; Wei Xu
Journal:  BMC Bioinformatics       Date:  2008-01-24       Impact factor: 3.169

  6 in total
  61 in total

1.  Deriving transcriptional programs and functional processes from gene expression databases.

Authors:  Jeffrey T Chang
Journal:  Bioinformatics       Date:  2012-03-08       Impact factor: 6.937

2.  Bayesian approach to transforming public gene expression repositories into disease diagnosis databases.

Authors:  Haiyan Huang; Chun-Chi Liu; Xianghong Jasmine Zhou
Journal:  Proc Natl Acad Sci U S A       Date:  2010-04-01       Impact factor: 11.205

3.  Chemocentric informatics approach to drug discovery: identification and experimental validation of selective estrogen receptor modulators as ligands of 5-hydroxytryptamine-6 receptors and as potential cognition enhancers.

Authors:  Rima Hajjo; Vincent Setola; Bryan L Roth; Alexander Tropsha
Journal:  J Med Chem       Date:  2012-06-11       Impact factor: 7.446

4.  A Comprehensive Database and Analysis Framework To Incorporate Multiscale Data Types and Enable Integrated Analysis of Bioactive Polyphenols.

Authors:  Lap Ho; Haoxiang Cheng; Jun Wang; James E Simon; Qingli Wu; Danyue Zhao; Eileen Carry; Mario G Ferruzzi; Jeremiah Faith; Breanna Valcarcel; Ke Hao; Giulio M Pasinetti
Journal:  Mol Pharm       Date:  2018-02-05       Impact factor: 4.939

5.  Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

Authors:  Marcos Martínez-Romero; Martin J O'Connor; Ravi D Shankar; Maryam Panahiazar; Debra Willrett; Attila L Egyedi; Olivier Gevaert; John Graybeal; Mark A Musen
Journal:  AMIA Annu Symp Proc       Date:  2018-04-16

6.  A compendium of monocyte transcriptome datasets to foster biomedical knowledge discovery.

Authors:  Darawan Rinchai; Sabri Boughorbel; Scott Presnell; Charlie Quinn; Damien Chaussabel
Journal:  F1000Res       Date:  2016-03-07

7.  IGF-1 deficiency in a critical period early in life influences the vascular aging phenotype in mice by altering miRNA-mediated post-transcriptional gene regulation: implications for the developmental origins of health and disease hypothesis.

Authors:  Stefano Tarantini; Cory B Giles; Jonathan D Wren; Nicole M Ashpole; M Noa Valcarcel-Ares; Jeanne Y Wei; William E Sonntag; Zoltan Ungvari; Anna Csiszar
Journal:  Age (Dordr)       Date:  2016-08-26

8.  Improving gene expression similarity measurement using pathway-based analytic dimension.

Authors:  Changwon Keum; Jung Hoon Woo; Won Seok Oh; Sue-Nie Park; Kyoung Tai No
Journal:  BMC Genomics       Date:  2009-12-03       Impact factor: 3.969

9.  Microarray meta-analysis database (M(2)DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database.

Authors:  Wei-Chung Cheng; Min-Lung Tsai; Cheng-Wei Chang; Ching-Lung Huang; Chaang-Ray Chen; Wun-Yi Shu; Yun-Shien Lee; Tzu-Hao Wang; Ji-Hong Hong; Chia-Yang Li; Ian C Hsu
Journal:  BMC Bioinformatics       Date:  2010-08-10       Impact factor: 3.169

10.  Evaluating the microRNA-target gene regulatory network in renal cell carcinomas, identification for potential biomarkers and critical pathways.

Authors:  Jun Li; Jian-Hua Huang; Qing-Hua Qu; Qier Xia; Deng-Shan Wang; Lei Jin; Chang Sheng
Journal:  Int J Clin Exp Med       Date:  2015-05-15
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.