Literature DB >> 23732275

Relating genes to function: identifying enriched transcription factors using the ENCODE ChIP-Seq significance tool.

Raymond K Auerbach¹, Bin Chen, Atul J Butte.

Abstract

MOTIVATION: Biological analysis has shifted from identifying genes and transcripts to mapping these genes and transcripts to biological functions. The ENCODE Project has generated hundreds of ChIP-Seq experiments spanning multiple transcription factors and cell lines for public use, but tools for a biomedical scientist to analyze these data are either non-existent or tailored to narrow biological questions. We present the ENCODE ChIP-Seq Significance Tool, a flexible web application leveraging public ENCODE data to identify enriched transcription factors in a gene or transcript list for comparative analyses. IMPLEMENTATION: The ENCODE ChIP-Seq Significance Tool is written in JavaScript on the client side and has been tested on Google Chrome, Apple Safari and Mozilla Firefox browsers. Server-side scripts are written in PHP and leverage R and a MySQL database. The tool is available at http://encodeqt.stanford.edu. CONTACT: abutte@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2013 PMID： 23732275 PMCID： PMC3712221 DOI： 10.1093/bioinformatics/btt316

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Identifying gene or transcript signatures associated with development, disease and drug efficacy has been a focal point of biology and medicine since the release of the human genome sequence; however, focus has only recently shifted to relating these signatures to function on a genome-wide scale. Thanks to next-generation sequencing assays such as ChIP-Seq that query an entire genome, transcription factor-binding sites that may explain the underlying biochemical function behind these signatures can be identified (Johnson ; Robertson ). The ENCODE Consortium, in particular, has spent over US$123 million producing and analyzing functional assays for public use. As part of this effort, ENCODE has generated 843 ChIP-Seq experiments (including histone modifications and control experiments) across >90 cell lines from various tissues, treatments and conditions (ENCODE Project Consortium ). Although ENCODE ChIP-Seq experiments represent an invariable treasure trove of data to intersect against gene/transcript signatures and identify enriched transcription factors (TFs), the Consortium does not provide a simple tool for this purpose short of downloading and parsing each results file individually. Other tools, such as Cscan, have tried to fill this void, but their configuration options focus only on transcription start sites (Zambelli ). Additionally, Cscan uses the set of all genes from the entire human genome as a background set, rendering it impractical for signatures derived from expression microarrays or from array-capture–based sequencing experiments. We present the ENCODE ChIP-Seq Significance Tool, a simple, flexible single-page web application that addresses these gaps in existing tools, leverages public ChIP-Seq data from the ENCODE Production Phase and provides biomedical researchers the ability to conduct comparative analyses with their list of gene/transcript signatures.

2 DESCRIPTION

2.1 Underlying database

The ENCODE ChIP-Seq Significance Tool leverages a MySQL database of official, unified peak calls from 708 ENCODE ChIP-Seq non-histone and non-control experiments, encompassing 220 transcription factor and treatment combinations across 91 cell types. We first represent each called peak by the genomic position of its apex to minimize the effect of broader peak shapes biasing our database. The significant peak calls and the apex positions are those officially released by ENCODE for unrestricted public use. Using all protein-coding genes and transcripts as well as pseudogenes identified in the Gencode v15 annotation along with corresponding IDs and symbols from Entrez, Ensembl, HAVANA and the HUGO Gene Nomenclature Committee, we intersected the positions of all ChIP-Seq peak apexes from each experiment against the start and end positions of each gene, identifying the closest peak to the transcription start site (TSS) and transcription termination site (TTS). These values are recorded in the database for each gene/factor/cell line combination. Peak call files targeting the same combination of factor and cell line were pooled provided that the antibody target, cell treatments and binding context were the same. For example, experiments using different antibodies targeting the same carboxy-terminal domain repeat in the large subunit of RNA polymerase II to identify transcription initiation (e.g. Covance MMS-126 R and abcam ab5408) were pooled, but these experiments were kept separate from experiments using the abcam ab5095 antibody targeting the phosphorylated serine-2 of the same repeat to identify a stalled polymerase. Experiments using cells subjected to different treatments were also kept separate (e.g. untreated cells versus cells stimulated with interferon-γ).

2.2 ENCODE ChIP-Seq significance tool

The ENCODE ChIP-Seq Significance Tool is a single-page web application that identifies enriched TFs in gene or transcript lists and presents the separate results from each list in a unified view (Fig. 1). The user begins by defining parameters including the gene/transcript ID system (Ensembl, Entrez, HAVANA or gene symbol from hg19), the feature type to use as the center of the window (TSS/5′-end, TTS/3′-end or the entire gene/transcript body), as well as the upstream and downstream analysis window size in increments of 500 bp. A gene list can be compared with the union of any combination of ChIP-Seq experiments in ENCODE Tier 1, 2 and 3 cell lines that comprise the ENCODE unrestricted dataset. For each unique TF/treatment combination, the ENCODE ChIP-Seq Significance Tool queries our database to identify the number of genes in each gene list with at least one TF peak apex in the selected window around the TSS, TTS or gene body. Enrichment scores are calculated using a one-tailed hypergeometric test followed by multiple hypothesis correction using the FDR method (Benjamini and Hochberg, 1995; Supplementary Material). Enrichment scores and gene counts for each TF across each list are shown in a table that can be saved, printed or copied to the clipboard directly from the tool. We also allow the user to specify a custom background list, allowing our tool to be applied to microarray data and other data not generated from genome- or transcriptome-wide assays.

Fig. 1.

The ENCODE ChIP-Seq significance tool after a query

The ENCODE ChIP-Seq significance tool after a query In addition to identifying enriched TFs, researchers will also want to identify the specific genes that have a binding site for a particular transcription factor. By clicking on any cell in the results table, the user can retrieve a second table listing the gene IDs, associated metadata and links to external resources for each column and row combination. These tables can be copied to the user’s clipboard or saved to a file from within the web application.

2.3 Advantages over existing tools

Our curated database of ENCODE ChIP-Seq peaks coupled with our web interface offers several distinct advantages. First, existing tools tend to limit analyses to transcription start sites, but ENCODE data also include chromatin remodelers, DNA repair proteins and other classes of DNA-binding proteins that exhibit more promiscuous binding patterns. By offering the ability to select whole body or TTS analyses in addition to TSS analyses, we give researchers the flexibility to explore a wider range of biological questions. Existing tools also tend to restrict researchers to a limited range of window sizes near TSSs (e.g. ±1 kb). Our tool allows researchers to select a window size up to 5000 bp (in 500 bp increments) on each side of the TSS, TTS or gene/transcript body. Protein-coding genes/transcripts are also separated from pseudogenes in our underlying annotation. Additionally, we allow researchers to supply a custom background list, extending the application of our tool to signatures derived from methods that do not query the entire genome. Finally, comparative analysis between gene lists is common in many biomedical data analyses, but existing web tools often present comparisons one list at a time. For a researcher with multiple gene lists, this often means undertaking a time-consuming process of processing, saving and collating multiple comparisons manually. The ENCODE ChIP-Seq Significance Tool allows researchers to enter multiple lists in a single query, calculates significance scores for each list, and presents the separate results in a unified table to reduce both analysis time and the potential for human error.

3 RESULTS AND DISCUSSION

The ENCODE ChIP-Seq Significance Tool can be applied to a wide range of biological questions. We discuss two examples in the Supplementary Material: identifying the functional context of an unknown sequence motif and uncovering possible mechanisms of action for the drug dexamethasone. In the first example, we find that the unknown sequence motif appears in many promoter regions that are enriched for Polycomb group proteins, indicating genes that are likely repressed or that require strict regulation of chromatin structure. The unknown motif may even represent a binding site for a different Polycomb group protein. The second example shows that glucocorticoid receptor (Gr) is the most enriched TF among genes that are significantly upregulated after treatment with dexamethasone, an agonist of Gr. Most of the observed Gr-binding genes are related to inflammatory and immune disease, demonstrating that dexamethasone may act as an anti-inflammatory and an immunosuppressant mainly by targeting Gr-regulated, disease-related genes. This analysis helps identify the possible Gr-binding genes responsible for the drug action (e.g. the gene NFKBIA). Please see the Supplementary Material for additional information about tool usage and features.

4 in total

1. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.

Authors: Gordon Robertson; Martin Hirst; Matthew Bainbridge; Misha Bilenky; Yongjun Zhao; Thomas Zeng; Ghia Euskirchen; Bridget Bernier; Richard Varhol; Allen Delaney; Nina Thiessen; Obi L Griffith; Ann He; Marco Marra; Michael Snyder; Steven Jones
Journal: Nat Methods Date: 2007-06-11 Impact factor: 28.547

2. Genome-wide mapping of in vivo protein-DNA interactions.

Authors: David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal: Science Date: 2007-05-31 Impact factor: 47.728

3. Cscan: finding common regulators of a set of genes by using a collection of genome-wide ChIP-seq datasets.

Authors: Federico Zambelli; Gian Marco Prazzoli; Graziano Pesole; Giulio Pavesi
Journal: Nucleic Acids Res Date: 2012-06-04 Impact factor: 16.971

4. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

4 in total

41 in total

1. A comprehensive time-course-based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set.

Authors: Timothy E Sweeney; Aaditya Shidham; Hector R Wong; Purvesh Khatri
Journal: Sci Transl Med Date: 2015-05-13 Impact factor: 17.956

2. Oncogenic Activation of the RNA Binding Protein NELFE and MYC Signaling in Hepatocellular Carcinoma.

Authors: Hien Dang; Atsushi Takai; Marshonna Forgues; Yotsowat Pomyen; Haiwei Mou; Wen Xue; Debashish Ray; Kevin C H Ha; Quaid D Morris; Timothy R Hughes; Xin Wei Wang
Journal: Cancer Cell Date: 2017-07-10 Impact factor: 31.743

3. The DYT6 Dystonia Protein THAP1 Regulates Myelination within the Oligodendrocyte Lineage.

Authors: Dhananjay Yellajoshyula; Chun-Chi Liang; Samuel S Pappas; Silvia Penati; Angela Yang; Rodan Mecano; Ravindran Kumaran; Stephanie Jou; Mark R Cookson; William T Dauer
Journal: Dev Cell Date: 2017-07-10 Impact factor: 12.270

4. Relationship of DNA methylation and gene expression in idiopathic pulmonary fibrosis.

Authors: Ivana V Yang; Brent S Pedersen; Einat Rabinovich; Corinne E Hennessy; Elizabeth J Davidson; Elissa Murphy; Brenda Juan Guardela; John R Tedrow; Yingze Zhang; Mandal K Singh; Mick Correll; Marvin I Schwarz; Mark Geraci; Frank C Sciurba; John Quackenbush; Avrum Spira; Naftali Kaminski; David A Schwartz
Journal: Am J Respir Crit Care Med Date: 2014-12-01 Impact factor: 21.405

5. Homoharringtonine deregulates MYC transcriptional expression by directly binding NF-κB repressing factor.

Authors: Xin-Jie Chen; Wei-Na Zhang; Bing Chen; Wen-Da Xi; Ying Lu; Jin-Yan Huang; Yue-Ying Wang; Jun Long; Song-Fang Wu; Yun-Xiang Zhang; Shu Wang; Si-Xing Li; Tong Yin; Min Lu; Xiao-Dong Xi; Jun-Min Li; Kan-Kan Wang; Zhu Chen; Sai-Juan Chen
Journal: Proc Natl Acad Sci U S A Date: 2019-01-18 Impact factor: 11.205

6. Epigenomic elements enriched in the promoters of autoimmunity susceptibility genes.

Authors: Mikhail G Dozmorov; Jonathan D Wren; Marta E Alarcón-Riquelme
Journal: Epigenetics Date: 2013-11-08 Impact factor: 4.528

7. Activin/Smad2-induced Histone H3 Lys-27 Trimethylation (H3K27me3) Reduction Is Crucial to Initiate Mesendoderm Differentiation of Human Embryonic Stem Cells.

Authors: Lu Wang; Xuanhao Xu; Yaqiang Cao; Zhongwei Li; Hao Cheng; Gaoyang Zhu; Fuyu Duan; Jie Na; Jing-Dong J Han; Ye-Guang Chen
Journal: J Biol Chem Date: 2016-12-13 Impact factor: 5.157

8. Organ size control is dominant over Rb family inactivation to restrict proliferation in vivo.

Authors: Ursula Ehmer; Anne-Flore Zmoos; Raymond K Auerbach; Dedeepya Vaka; Atul J Butte; Mark A Kay; Julien Sage
Journal: Cell Rep Date: 2014-07-10 Impact factor: 9.423

9. BART: a transcription factor prediction tool with query gene sets or epigenomic profiles.

Authors: Zhenjia Wang; Mete Civelek; Clint L Miller; Nathan C Sheffield; Michael J Guertin; Chongzhi Zang
Journal: Bioinformatics Date: 2018-08-15 Impact factor: 6.937

10. Genome-Wide Mapping and Interrogation of the Nmp4 Antianabolic Bone Axis.

Authors: Paul Childress; Keith R Stayrook; Marta B Alvarez; Zhiping Wang; Yu Shao; Selene Hernandez-Buquer; Justin K Mack; Zachary R Grese; Yongzheng He; Daniel Horan; Fredrick M Pavalko; Stuart J Warden; Alexander G Robling; Feng-Chun Yang; Matthew R Allen; Venkatesh Krishnan; Yunlong Liu; Joseph P Bidwell
Journal: Mol Endocrinol Date: 2015-08-05