| Literature DB >> 30594974 |
Zichen Wang1, Alexander Lachmann2, Avi Ma'ayan2.
Abstract
Publicly available gene expression datasets deposited in the Gene Expression Omnibus (GEO) are growing at an accelerating rate. Such datasets hold great value for knowledge discovery, particularly when integrated. Although numerous software platforms and tools have been developed to enable reanalysis and integration of individual, or groups, of GEO datasets, large-scale reuse of those datasets is impeded by minimal requirements for standardized metadata both at the study and sample levels as well as uniform processing of the data across studies. Here, we review methodologies developed to facilitate the systematic curation and processing of publicly available gene expression datasets from GEO. We identify trends for advanced metadata curation and summarize approaches for reprocessing the data within the entire GEO repository.Entities:
Keywords: Computational data curation; FAIR principles; GEO; Gene Expression Omnibus; Natural language processing
Year: 2018 PMID: 30594974 PMCID: PMC6381352 DOI: 10.1007/s12551-018-0490-8
Source DB: PubMed Journal: Biophys Rev ISSN: 1867-2450
Fig. 1The growth of publicly available gene expression datasets and samples from GEO over time. Plots on the top panel show the growth of gene expression datasets from different transcriptomic profiling technologies over time, whereas plots on the bottom panel show the growth of individual samples from those datasets. The plots were made on September 2018. Hence, the total for 2018 cover only part of the year
Software tools developed for reanalyzing and further annotating GEO datasets
| Tool | Citation | Individual/multiple | Type | Note | Limitations |
|---|---|---|---|---|---|
| GEO2R | (Barrett et al. | Individual | Web | Implements GUI that generate graphs and R script | Limited graphical visualizations; only implements DE analysis; limited to microarray data |
| shinyGEO | (Dumas et al. | Individual | Web | R Shiny extension of GEO2R with improved graphics | DE analysis only available for individual genes; limited to microarray data |
| GEOquery | (Davis and Meltzer | Individual | R package | Bridge between GEO and BioConductor to enable analyses of GEO datasets in various BioConductor packages | Requires users to be proficient in R and Bioconductor packages; limited to microarray data |
| GEO2Enrichr | (Gundersen et al. | Individual | Brower extension | Identifies DEGs and pipe to enrichment analysis tool | Limited to microarray data; limited analysis components |
| BioJupies | (Torre et al. | Individual | Web | Generates interactive Jupyter notebooks from RNA-seq datasets | Limited to RNA-seq data. Only allows 2 group comparison |
| ScanGEO | (Koeppen et al. | Multiple | Web | Identifies DEGs across multiple GEO studies matching user-specified criteria | Limited to curated GEO datasets (GDS); only supports DE analysis |
| ImaGEO | (Toro-Domínguez et al. | Multiple | Web | Performs nine types of meta-analysis across multiple GEO studies | Limited to microarray datasets |
| GEOracle | (Djordjevic et al. | Multiple | Web | Uses text mining of the GEO metadata to automatically identify perturbational GEO datasets and associated metadata | Limited to microarray datasets; only performs DE analysis |
Fig. 2Graphical summary of various curation approaches for further annotating GEO datasets. Metadata and the gene expression data from an example GEO study are shown on the left. Metadata are composed of semi-structured textual annotations supplied by the authors of the dataset at both study-level and sample-level to describe the experimental design of the study, and the characteristics of the samples. The goal of further annotating GEO datasets is to generate structured metadata for each study (top right) and samples (bottom right). Annotations are linked to relevant controlled vocabularies such as ontologies. Three approaches are visualized as arrows: manual curation and automated NLP, both attempt to identify and extract structured metadata from the textual descriptions. In addition, metadata can be inferred from the gene expression data using supervised machine learning approaches