| Literature DB >> 25408880 |
Andrew Paul Hutchins1, Ralf Jauch2, Mateusz Dyla3, Diego Miranda-Saavedra4.
Abstract
Genomic datasets and the tools to analyze them have proliferated at an astonishing rate. However, such tools are often poorly integrated with each other: each program typically produces its own custom output in a variety of non-standard file formats. Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files. Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing. In summary, glbase is a flexible and multifunctional toolkit that allows the combination and analysis of high-throughput data (especially next-generation sequencing and genome-wide data), and which has been instrumental in the analysis of complex data sets. glbase is freely available at http://bitbucket.org/oaxiom/glbase/.Entities:
Keywords: Bioinformatics; ChIP-seq; Genomics; Microarray; Motifs; RNA-seq; Transcription factor
Year: 2014 PMID: 25408880 PMCID: PMC4230833 DOI: 10.1186/2045-9769-3-1
Source DB: PubMed Journal: Cell Regen (Lond) ISSN: 2045-9769
Figure 1A schematic overview of the functions included in glbase. glbase accepts files in a variety of formats, brings them into a Python environment as ‘genelist’ objects which behave like a Python list of key:value pairs. Data can be manipulated within glbase using a variety of built-in functions, and subsequently output in specific formats or graphically for the visual interpretation of (combined) datasets.
Figure 2Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].