Literature DB >> 25408880

glbase: a framework for combining, analyzing and displaying heterogeneous genomic and high-throughput sequencing data.

Andrew Paul Hutchins¹, Ralf Jauch², Mateusz Dyla³, Diego Miranda-Saavedra⁴.

Abstract

Genomic datasets and the tools to analyze them have proliferated at an astonishing rate. However, such tools are often poorly integrated with each other: each program typically produces its own custom output in a variety of non-standard file formats. Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files. Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing. In summary, glbase is a flexible and multifunctional toolkit that allows the combination and analysis of high-throughput data (especially next-generation sequencing and genome-wide data), and which has been instrumental in the analysis of complex data sets. glbase is freely available at http://bitbucket.org/oaxiom/glbase/.

Entities: CellLine Chemical Disease Gene Species

Keywords: Bioinformatics; ChIP-seq; Genomics; Microarray; Motifs; RNA-seq; Transcription factor

Year: 2014 PMID： 25408880 PMCID： PMC4230833 DOI： 10.1186/2045-9769-3-1

Source DB: PubMed Journal: Cell Regen (Lond) ISSN： 2045-9769

Background

Genome-scale experiments are rapidly becoming a standard addition to the scientists’ toolkit. However, the development of tools to analyze high-throughput data has lagged behind our ability to generate larger and larger data-sets, and despite some standardization efforts, custom file formats continue to proliferate. Many of the tools currently used to analyze genome-wide data are very diverse and produce a variety of custom outputs that rarely feed directly into other bioinformatics tools without pre-processing of the file into standard file formats. A common way to get around this is to create ad hoc scripts in some combination of UNIX shell, awk, Perl, Python or other programming language and use these scripts to address the problem at hand. However, these scripts are often designed with only a single usage in mind, lack a detailed methodology, may be poorly documented or not preserved at all, and are rarely tested for accuracy and consistency. Efforts have been made to make this process more transparent; Galaxy is a comprehensive web server with a large number of functions to deal with genome-scale data [1], but it is a web-server aimed primarily at non-programming scientists, requires extensive user interaction and therefore is difficult to automate, thus losing the advantages of a programming environment or the UNIX shell. BEDTools [2] and SAMtools [3] deal efficiently with the standardized genome file formats BED and SAM, but do not deal gracefully with non-standard file inputs or even poorly or incorrectly formatted files. The Biopython [4] and Bioperl [5] projects similarly attempt to deal with these problems, but these projects have such a large scope across all of their subject areas that the analysis of high-throughput sequencing has been relatively neglected to date. The Bioconductor [6] project for the R language has a massive scope, with multiple tools from multiple developers that can come together to form a potent analysis toolkit. It is well documented and has become one of the major analytical frameworks for genomic analysis. Yet it has some limitations, the R language has a steep learning curve and deployment of a users own methods or functions is difficult. One of the original motivations for the development of glbase was to format files suitable for the import format required by R and it still fulfills this role. The Genomic Hyperbrowser [7] takes an interesting novel approach to the analysis of genomic data, built on top of the Galaxy framework it uses the widespread concept of ‘tracks’ (i.e. collections of genomic features, genes, exons, epigenetic data, etc) to which the user defines a putative relationship describing the two tracks and a null model and then the Hyperbrowser will test this relationship. In this way the Hyperbrowser brings a more statistical and mathematical approach to the analysis of genomic data. Although primarily presented as a web server it also makes available a programmatic interface. ArrayPlex [8] provides a framework similar to glbase for the analysis of heterogenous genomic data, in addition to providing a graphical interface it also exposes its functionality through the UNIX shell as executable commands. ArrayPlex is mainly focused on the retrieval of data from publicly accessible webservers. CruzDB [9] is the tool most similar to glbase. Also implemented in Python it provides a convenient system to extract data primarily from the UCSC genome browser, process the data in Python and then submit the data to other tools. It does not contain any internal drawing methods, although it should integrate well with Python plotting libraries such as matplotlib and potentially also with glbase. Tools originally designed for DNA motif discovery, such as HOMER [10] and MEME [11] are also expanding in their scope and offer an increasing diversity of genomic analysis methods that are exposed to the user not only in the form of a web server but also as tools that can integrate with the command line for automation. glbase is a project designed to complement the above tools for the analysis of genomic data. Using the advantages of the Python programming language glbase aims to directly translate biological questions into Python code. To assist in that glbase deals with several problems. Firstly it acts as an intermediary between tools. Secondly it provides a relatively compact programming syntax. Thirdly it incorporates many common analytical methods to integrate data. Finally, glbase provides tools for the graphical output of data analyses. glbase deals with the problem of incompatible file formats between different tools not by suggesting a top-down standardization of file formats, but instead by providing a simple means to describe diverse file formats and load them into a Python programming environment. Additionally, glbase facilitates the down-stream processing of the data as it includes a suite of common analysis tools, such as heatmaps and sequence read pileups. glbase has been designed to interact more generally with other Python tools, such as statistics with SciPy and graphical outputs with matplotlib, and data can also be exported into other file formats for analysis in yet further tools or imported into R. In this way glbase acts as the ‘glue’ between up-stream analysis (e.g. the genomic alignment of sequencing reads and ChIP-seq peak discovery) and down-stream analysis (e.g. ChIP-seq peak annotation, combining ChIP-seq/RNA-seq data, and the production of publication-quality figures). glbase is implemented as a Python module designed to be used non-interactively to write short scripts to achieve specific aims, leaving a permanent record of the user’s processes, thus documenting the data analysis process to make it repeatable. Furthermore, glbase incorporates methods to overlap and annotate genomic intervals (similar to BEDTools [2]), to map common values across two lists (similar to but more powerful than the UNIX command ‘join’), support for genomic coordinates to gene annotations and for extracting sequence data from FASTA files. Also included in glbase is a selection of analysis tools to produce a variety of graphical summaries of data, including heatmaps, scatter plots, pie charts and histograms of genomic and expression data. Finally, glbase features a flexible and efficient SQL implementation for storing genomic-scale data, such as high-throughput sequence reads or phastCons evolutionary scores [12], which allow the efficient random-access retrieval of numerical or sequence reads from within millions of sequencing tags. Figure 1 gives a schematic overview of the functions available in glbase. glbase is especially suited to the analysis of next generation sequencing and genome-wide data, particularly ChIP-seq, RNA-seq and microarray expression data.

Figure 1

A schematic overview of the functions included in glbase. glbase accepts files in a variety of formats, brings them into a Python environment as ‘genelist’ objects which behave like a Python list of key:value pairs. Data can be manipulated within glbase using a variety of built-in functions, and subsequently output in specific formats or graphically for the visual interpretation of (combined) datasets.

Results and discussion

Genelists and flexible file format specifiers

glbase is built primarily around objects called ‘genelists’ , which are lists of key:value pairs with many associated methods. For example, given the output from the MACS peak-discovery tool [13], here in the format of a BED file, it can be loaded using two lines of Python: The contents of the genelist can be interrogated, showing the index, and a list of < key>: pairs: The genelist object behaves in a manner similar to a normal Python list and can be iterated over, and its values extracted, sorted, sliced and searched. In addition, genelists contain many special methods for working on genomic intervals, particularly for intersecting two lists of genomic locations (similar to BEDTools [2]), but does not require the files to be in BED format, only that they have a correctly formatted ‘loc’ key containing a genomic interval resembling ‘chr1:1000000-1001000’. Genomic intervals can be systematically modified: Genelists can also be intersected by pairs of matching keys, made unique for any key, and many other methods to manipulate the data contained within the genelist. Finally, the resulting genelists can be saved in a variety of file formats, such as custom TSV (tab-separated value) and standard BED files.

Flexible specifiers to describe any arrangement of tabular data

In addition to loading standard file formats, such as BED, SAM, GTF/GFF and FASTA, glbase includes a flexible way to describe any tabular file format (for example tab-separated value [TSV] and comma-separated value [CSV] files). glbase just needs to know the names of the keys and the column number they appear in inside the TSV to load the file into glbase. For example, this line of code will describe the full formal definition of a BED file: In the example above each value specifies the key name and the column number of the TSV file to find the data in. This flexible format specifier can be used to describe almost any TSV file for loading into glbase.

Analysis and graphical outputs

In addition to acting as a universal file format converter, a second major utility of glbase is to act as the ‘glue’ between up-stream and down-stream analysis tools, for instance to get from a list of ChIP-seq peaks and gene expression values to heatmaps, gene-peak associations and other informative plots. As an example of usage, glbase includes a tool for finding words in FASTA-formatted DNA sequences: Figure 2A shows an example of the frequency of the STAT3 DNA-binding motif (word) ‘TTCnnnGAA’ in a list of STAT3 ChIP-seq binding data [14]. For any key in a genelist, its frequency can be measured with a pie chart. glbase can also deal with expression data through the derived genelist-like object ‘expression’ that contains methods for drawing heatmaps (Figure 2B) as well as histograms, boxplots, scatter plots (Figure 2C) and the ability to transform the expression data (fold-change, log-transform, normalize, etc.). Expression data and ChIP-seq data can be combined to produce density maps of ChIP-seq binding against changes in gene expression or to annotate scatter plots. ChIP-seq data can be compared against any set of genomic annotations, for example gene transcription start sites, to produce a breakdown of distances from the binding site to the transcription start site. Figure 2D shows the distribution of STAT3 binding sites in IL-10 stimulated macrophages relative to the nearest transcription start site [14]. Phylogenetic data (e.g. phastCons scores of evolutionary conservation, any type of numeric data can be used) can be loaded into an SQL database by glbase and then pileups can be visualized (Figure 2E). Similarly, sequence reads can be converted by glbase into an SQL database for efficient retrieval of the reads across arbitrary genomic locations. Figure 2F shows a heatmap of the density of sequence tag reads from a p300 ChIP-seq library centered on a list of Sox2-Oct4 bound region in embryonic stem cells [15, 16].

Figure 2

Example graphical output from glbase. Code and raw data can be found in the glbase directory (glbase/examples/). (A) Frequency of the STAT3 DNA-binding word (‘TTCnnnGAA’) in a list of STAT3 ChIP-seq binding sites, compared to a random selected background from the control ChIP-seq sample. (B) Heatmap of top 20 and bottom 20 up- and down-regulated transcription factors when macrophages are stimulated with IL-10. (C) Scatter plot of RNA-seq data (D) Genomic distribution of STAT3 binding in IL-10 stimulated macrophages. (E) Average phastCons evolutionary conservation score around a list of Sox2-Oct4 ChIP-seq binding peaks. (F) Heatmap of p300 recruitment in mouse ES cells for a list of Sox2-Oct4 ChIP-seq binding peaks. Raw data comes from the GEO accessions GSE31531 [14], the ENCODE project [16], GSE11431 [15] and the phastCons measure of evolutionary conservation [12]. Transcription factor annotation was based on the DNA-binding domain database [17].

Conclusions

glbase is a flexible and multifunctional toolkit allowing the user to perform many common analyses on ChIP-seq, microarray and RNA-seq data. Data from distinct sources can be combined inside a unified framework within a Python programming environment for direct analysis of the data, or processed and output for further analysis. glbase has already been used extensively in the analysis of STAT3 binding in macrophages [14], the analysis of STAT3 binding in multiple cell types [18], in analyzing the changes in the transcriptome of stimulated CD4+ T cells [19], and in the analysis of how mutated Sox17 co-operates with Oct4 to specify induced pluripotent stem cells [20, 21]. Thus glbase constitutes a useful addition to the researchers’ toolkit.

Availability and requirements

glbase was developed in Python and uses the freely available Python modules NumPy, SciPy and matplotlib. All functions in glbase are documented in Python (for example, to see the documentation for the map() method of genelists, type: help(glbase.genelist.map)), and documentation is also available as part of the distribution (glbase/docs/build/html/index.html), which also includes seven tutorials, code and example raw data (glbase/examples/) directly aimed at potential users with little or no Python experience. glbase is freely available from http://bitbucket.org/oaxiom/glbase/.

21 in total

1. Genome-wide analysis of STAT3 binding in vivo predicts effectors of the anti-inflammatory response in macrophages.

Authors: Andrew Paul Hutchins; Stéphane Poulain; Diego Miranda-Saavedra
Journal: Blood Date: 2012-02-09 Impact factor: 22.113

2. Discovery and characterization of new transcripts from RNA-seq data in mouse CD4(+) T cells.

Authors: Andrew Paul Hutchins; Stéphane Poulain; Hodaka Fujii; Diego Miranda-Saavedra
Journal: Genomics Date: 2012-08-04 Impact factor: 5.736

3. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors: Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal: Genome Res Date: 2005-07-15 Impact factor: 9.043

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

5. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

6. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells.

Authors: Xi Chen; Han Xu; Ping Yuan; Fang Fang; Mikael Huss; Vinsensius B Vega; Eleanor Wong; Yuriy L Orlov; Weiwei Zhang; Jianming Jiang; Yuin-Han Loh; Hock Chuan Yeo; Zhen Xuan Yeo; Vipin Narang; Kunde Ramamoorthy Govindarajan; Bernard Leong; Atif Shahab; Yijun Ruan; Guillaume Bourque; Wing-Kin Sung; Neil D Clarke; Chia-Lin Wei; Huck-Hui Ng
Journal: Cell Date: 2008-06-13 Impact factor: 41.582

7. Distinct transcriptional regulatory modules underlie STAT3's cell type-independent and cell type-specific functions.

Authors: Andrew Paul Hutchins; Diego Diez; Yoshiko Takahashi; Shandar Ahmad; Ralf Jauch; Michel Lucien Tremblay; Diego Miranda-Saavedra
Journal: Nucleic Acids Res Date: 2013-01-07 Impact factor: 16.971

8. A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors:
Journal: PLoS Biol Date: 2011-04-19 Impact factor: 8.029

9. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

10. DBD--taxonomically broad transcription factor predictions: new content and functionality.

Authors: Derek Wilson; Varodom Charoensawan; Sarah K Kummerfeld; Sarah A Teichmann
Journal: Nucleic Acids Res Date: 2007-12-11 Impact factor: 16.971

29 in total

1. Capturing the interactome of newly transcribed RNA.

Authors: Xichen Bao; Xiangpeng Guo; Menghui Yin; Muqddas Tariq; Yiwei Lai; Shahzina Kanwal; Jiajian Zhou; Na Li; Yuan Lv; Carlos Pulido-Quetglas; Xiwei Wang; Lu Ji; Muhammad J Khan; Xihua Zhu; Zhiwei Luo; Changwei Shao; Do-Hwan Lim; Xiao Liu; Nan Li; Wei Wang; Minghui He; Yu-Lin Liu; Carl Ward; Tong Wang; Gong Zhang; Dongye Wang; Jianhua Yang; Yiwen Chen; Chaolin Zhang; Ralf Jauch; Yun-Gui Yang; Yangming Wang; Baoming Qin; Minna-Liisa Anko; Andrew P Hutchins; Hao Sun; Huating Wang; Xiang-Dong Fu; Biliang Zhang; Miguel A Esteban
Journal: Nat Methods Date: 2018-02-12 Impact factor: 28.547

2. Vitamin C-dependent lysine demethylase 6 (KDM6)-mediated demethylation promotes a chromatin state that supports the endothelial-to-hematopoietic transition.

Authors: Tian Zhang; Ke Huang; Yanling Zhu; Tianyu Wang; Yongli Shan; Bing Long; Yuhang Li; Qianyu Chen; Pengtao Wang; Shaoyang Zhao; Dongwei Li; Chuman Wu; Baoqiang Kang; Jiaming Gu; Yuchan Mai; Qing Wang; Jinbing Li; Yanqi Zhang; Zechuan Liang; Lin Guo; Fang Wu; Shuquan Su; Junwei Wang; Minghui Gao; Xiaofen Zhong; Baojian Liao; Jiekai Chen; Xiao Zhang; Xiaodong Shu; Duanqing Pei; Jinfu Nie; Guangjin Pan
Journal: J Biol Chem Date: 2019-07-24 Impact factor: 5.157

3. Metabolic and epigenetic dysfunctions underlie the arrest of in vitro fertilized human embryos in a senescent-like state.

Authors: Yang Yang; Liyang Shi; Xiuling Fu; Gang Ma; Zhongzhou Yang; Yuhao Li; Yibin Zhou; Lihua Yuan; Ye Xia; Xiufang Zhong; Ping Yin; Li Sun; Wuwen Zhang; Isaac A Babarinde; Yongjun Wang; Xiaoyang Zhao; Andrew P Hutchins; Guoqing Tong
Journal: PLoS Biol Date: 2022-06-30 Impact factor: 9.593

4. MYOCD is Required for Cardiomyocyte-like Cells Induction from Human Urine Cells and Fibroblasts Through Remodeling Chromatin.

Authors: Xiangyu Zhang; Lijun Chen; Xingnan Huang; Huan Chen; Baomei Cai; Yue Qin; Yating Chen; Sihua Ou; Xiaoxi Li; Zichao Wu; Ziyu Feng; Mengying Zeng; Wenjing Guo; Heying Li; Chunhua Zhou; Shengyong Yu; Mengjie Pan; Jing Liu; Kai Kang; Shangtao Cao; Duanqing Pei
Journal: Stem Cell Rev Rep Date: 2022-03-04 Impact factor: 6.692

5. Efficient induction of neural progenitor cells from human ESC/iPSCs on Type I Collagen.

Authors: Pengfei Liu; Shubin Chen; Yaofeng Wang; Xiaoming Chen; Yiping Guo; Chunhua Liu; Haitao Wang; Yifan Zhao; Di Wu; Yongli Shan; Jian Zhang; Chuman Wu; Dongwei Li; Yanmei Zhang; Tiancheng Zhou; Yaoyu Chen; Xiaobo Liu; Chenxu Li; Lihui Wang; Bei Jia; Jie Liu; Bo Feng; Jinglei Cai; Duanqing Pei
Journal: Sci China Life Sci Date: 2021-03-16 Impact factor: 6.038

6. The p53-induced lincRNA-p21 derails somatic cell reprogramming by sustaining H3K9me3 and CpG methylation at pluripotency gene promoters.

Authors: Xichen Bao; Haitao Wu; Xihua Zhu; Xiangpeng Guo; Andrew P Hutchins; Zhiwei Luo; Hong Song; Yongqiang Chen; Keyu Lai; Menghui Yin; Lingxiao Xu; Liang Zhou; Jiekai Chen; Dongye Wang; Baoming Qin; Jon Frampton; Hung-Fat Tse; Duanqing Pei; Huating Wang; Biliang Zhang; Miguel A Esteban
Journal: Cell Res Date: 2014-12-16 Impact factor: 25.617

7. Dissecting the role of distinct OCT4-SOX2 heterodimer configurations in pluripotency.

Authors: Natalia Tapia; Caitlin MacCarthy; Daniel Esch; Adele Gabriele Marthaler; Ulf Tiemann; Marcos J Araúzo-Bravo; Ralf Jauch; Vlad Cojocaru; Hans R Schöler
Journal: Sci Rep Date: 2015-08-28 Impact factor: 4.379

8. SOXE transcription factors form selective dimers on non-compact DNA motifs through multifaceted interactions between dimerization and high-mobility group domains.

Authors: Yong-Heng Huang; Aleksander Jankowski; Kathryn S E Cheah; Shyam Prabhakar; Ralf Jauch
Journal: Sci Rep Date: 2015-05-27 Impact factor: 4.379

9. DNA-mediated cooperativity facilitates the co-selection of cryptic enhancer sequences by SOX2 and PAX6 transcription factors.

Authors: Kamesh Narasimhan; Shubhadra Pillay; Yong-Heng Huang; Sriram Jayabal; Barath Udayasuryan; Veeramohan Veerapandian; Prasanna Kolatkar; Vlad Cojocaru; Konstantin Pervushin; Ralf Jauch
Journal: Nucleic Acids Res Date: 2015-01-10 Impact factor: 16.971

10. Increased Expression of SETD7 Promotes Cell Proliferation by Regulating Cell Cycle and Indicates Poor Prognosis in Hepatocellular Carcinoma.

Authors: Yuanyuan Chen; Shengsheng Yang; Jiewei Hu; Chaoqin Yu; Miaoxia He; Zailong Cai
Journal: PLoS One Date: 2016-05-16 Impact factor: 3.240