Literature DB >> 27789702

Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse.

Shenglin Mei^1,2, Qian Qin^1,2, Qiu Wu^1,2, Hanfei Sun², Rongbin Zheng², Chongzhi Zang^3,4, Muyuan Zhu², Jiaxin Wu⁵, Xiaohui Shi², Len Taing³, Tao Liu⁶, Myles Brown^4,7, Clifford A Meyer^8,4, X Shirley Liu^9,2,3,4.

Abstract

Chromatin immunoprecipitation, DNase I hypersensitivity and transposase-accessibility assays combined with high-throughput sequencing enable the genome-wide study of chromatin dynamics, transcription factor binding and gene regulation. Although rapidly accumulating publicly available ChIP-seq, DNase-seq and ATAC-seq data are a valuable resource for the systematic investigation of gene regulation processes, a lack of standardized curation, quality control and analysis procedures have hindered extensive reuse of these data. To overcome this challenge, we built the Cistrome database, a collection of ChIP-seq and chromatin accessibility data (DNase-seq and ATAC-seq) published before January 1, 2016, including 13 366 human and 9953 mouse samples. All the data have been carefully curated and processed with a streamlined analysis pipeline and evaluated with comprehensive quality control metrics. We have also created a user-friendly web server for data query, exploration and visualization. The resulting Cistrome DB (Cistrome Data Browser), available online at http://cistrome.org/db, is expected to become a valuable resource for transcriptional and epigenetic regulation studies.

Entities: Disease Gene Species

Mesh：

Year: 2016 PMID： 27789702 PMCID： PMC5210658 DOI： 10.1093/nar/gkw983

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Genome-wide identification of transcription factor (TF) and chromatin regulator binding sites, histone modifications and chromatin accessibility is important for understanding transcriptional control governing biological processes such as differentiation, oncogenesis and cellular response to environmental perturbations (1–3). Massively parallel DNA sequencing combined with chromatin immunoprecipitation (ChIP-seq), DNase I hypersensitivity (DNase-seq) and the transposase-accessible chromatin assay (ATAC-seq) enable the genome-wide study of transcriptional regulation, histone modification and cis-regulatory elements (4–7). Over the past decade, ChIP-seq, DNase-seq and ATAC-seq data have rapidly accumulated in large repositories such as gene expression omnibus (GEO) (8,9) and European nucleotide archive (ENA) (10). Many experimental biologists may not have the bioinformatics expertise to effectively use these valuable resources. The Encyclopedia of DNA Elements (ENCODE) Consortium has provided high quality processed genome-wide histone modification, chromatin regulator and transcription factor binding data in a selected set of human cell lines (3,11) and the NIH Roadmap Epigenomics Project has built a similar resource for human stem cells and tissues (12). These projects, however, do not support data generated outside the consortia. Other ChIP-seq databases, such as Cistrome CR (13), BloodChIP (14), CTCFBSDS (15), only contain a limited number of samples, each focusing on a narrow selection of factor or tissue types. Standardized quality control and streamlined analysis of ChIP-seq and chromatin accessibility data have also been lacking. Most of the publicly available ChIP-seq and chromatin accessibility data are available in the GEO (8,9) and ENA (10) repositories. However, due to inconsistencies of metadata annotation, as well as lack of unified data processing procedures, quality control measures and interfaces for visualization and exploration, these valuable resources have been underutilized. We collected about 23 000 (pre January 1, 2016) samples including more than 800 TFs and 80 histone marks and processed these samples with a uniform pipeline, producing quality control metrics and analysis results. To make these resources more accessible to the research community, we developed Cistrome DB at http://cistrome.org/db, an integrated data analysis and visualization portal for ChIP-seq and chromatin accessibility data in human and mouse.

DATA BROWSER CONTENTS

All data sources and web-interface features are summarized in Figure 1. Cistrome DB consists of three parts: a curated metadata collection, processed data and a web interface.

Figure 1.

Schematic of Cistrome DB data sources and web-interface features. Cistrome DB collects publically available ChIP-seq, DNase-seq and ATAC-seq data from gene expression omnibus (GEO), Encyclopedia of DNA Elements (ENCODE) and Roadmap Epigenomics. Metadata is manually curated and annotated with PubMed information. All data are processed by a streamlined analysis pipeline and stored in a MySQL relationship database. Cistrome DB provides methods to query and visualize data. Users can search by key words or select by term. Detailed metadata annotations, analysis results and quality control (QC) metrics are presented for each sample. Data can be explored in more detail using the Cistrome analysis pipeline (Cistrome AP) and visualized using the UCSC and WashU genome browsers.

Data sources and metadata annotation

Cistrome DB is a comprehensive annotated resource of publicly available ChIP-seq and chromatin accessibility data in human and mouse. Samples collected include those from the NCBI GEO database (8,9), ENCODE database (16) and Roadmap Epigenomics project (12). We have systematically annotated the following metadata for each sample: species, biological sources (cell line/population, cell type, tissue origin, strain, disease state), factor name, PubMed ID and citation. Metadata were automatically parsed from GEO entries, followed by manual curation of factor names and biological sources to ensure annotation consistency. The number of ChIP-seq and chromatin accessibility experiments has increased dramatically since 2007, with 2015 alone contributing 8941 new samples (Figure 2A). In total, our database contains 23 319 ChIP-seq and chromatin accessibility samples, 9953 for mouse and 13 366 for human. Among them, there are 10 276 ChIP-seq samples for transcription factors and chromatin regulators, 10 680 ChIP-seq samples for histone modifications and variants, 1370 chromatin accessibility samples and the remaining 993 are classified as other (Figure 2B). These samples include 713 different TFs and 76 histone modifications/variants in human, 480 TFs and 71 histone modifications/variants in mouse (Figure 2C).

Figure 2.

Database content. (A) Growth statistics of ChIP-seq and chromatin accessibility data. (B) Statistics of processed ChIP-seq and chromatin accessibility (CA) data in Cistrome DB. (C) Statistics of transcription factor and histone modification type. (D) Example of Quality control metric in Cistrome DB. (E) Batch sample visualization through WashU browser showing the co-binding pattern between master transcription factors in embryonic stem cells.

Data processing and quality control metrics

In order to keep data consistent, all ChIP-seq, DNase-seq and ATAC-seq samples were processed by ChiLin (17,18), a streamlined pipeline for chromatin profiling data analysis and quality control. Analysis included three steps: read mapping, peak calling and peak annotation. We first downloaded raw ChIP-seq and chromatin accessibility data from GEO or ENA, and mapped them to the human (hg38) or the mouse (mm10) genome using BWA (19). Peak calling was done using MACS2 (20). Finally, we performed annotation analysis, including average conservation profiles across peak regions, motif analysis (21) and putative gene target identification (22). Comprehensive information about the tools and parameters used for data analysis can be found on the ‘About’ page of the Cistrome DB website. A set of quality control (QC) metrics is an important feature of Cistrome DB. We provide seven different QC criteria across three layers (Figure 2D). In the reads layer, the median quality score is used to evaluate the raw sequencing quality; uniquely mapped reads is used to reflect mapping quality; PCR bottleneck coefficient (PBC) (23) is used to identify potential over-amplification by PCR. In the ChIP layer, the FRiP score (23) (fraction of non-mitochondrial reads in peak regions) and the number of high quality peaks with 10- or 20-fold enrichment over background were calculated to show data quality at the ChIP level. QC measures in the annotation layer include the proportions of peaks in promoters, introns and intergenic regions, along with the proportion of peaks overlapping with a union of DNase-seq peaks across diverse cell types. Using our collection of 23 319 samples, we established the thresholds of quality control characteristics based on the overall distributions. The Cistrome DB result page displays whether or not the quality control characteristics of each sample meet these quality control thresholds. See the Supplementary Material for details on the definition and calculation of QC statistics. Distributions of quality control statistics and thresholds are shown in Supplementary Figure S1 and on the Cistrome DB website. Peak calling algorithms are specialized in identifying narrow or broad enrichment although there is no precise threshold that distinguishes one category from the other (24). Cistrome DB QC metrics are mostly developed for TFs and histone marks with sharp enrichment; for factors or marks with broad enrichment patterns the current QC measures might not be as reliable. The accuracy of ChIP-seq experiments is highly reliant on antibody specificity and quality and it is common for antibodies to recognize several proteins apart from the stated target. Current Cistrome DB QC metrics do not provide information on this issue. It is suggested that Cistrome DB users that are unfamiliar with potential pitfalls in ChIP-seq, DNase-seq or ATAC-seq understand the nature of biases in these data types before using Cistrome DB results in their research (18).

Data visualization and extensions

Cistrome DB also provides visualization functions that allow users to view peaks and signal intensity in either the UCSC (25) or WashU (26) genome browsers. Visualization of both single or batch samples is supported. For example, a ‘super-enhancer’ region of the genome contains multiple SOX2 and NANOG binding sites and is enriched in mediator and H3K27ac signals (Figure 2E). Using Cistrome DB, users can select the relevant ChIP-seq samples and visualize the co-binding pattern between master transcription factors on the WashU genome browser using the sample batch view function (Figure 2E). In addition, Cistrome DB can export data to our previously created Cistrome analysis pipeline (Cistrome AP) (21) for downstream analysis.

DATA RETRIEVAL

Samples

Cistrome DB provides automatically parsed and subsequent manual curation of metadata annotations for each sample, including species, factor name, biological source, publication and process status, which are stored in a local MySQL relational database. Each ChIP-seq or chromatin accessibility sample has a unique sample identifier. A web interface has been designed to provide user-friendly access and visualization. The result page displays detailed annotations, analysis results and quality control metrics for each sample. To track the source of the original sample, links to citations and the data repository are provided. Users can also download analysis results, send data to WashU or UCSC genome browsers for visualization, or Cistrome AP for subsequent analysis. In addition, Cistrome DB provides a list of putative target genes of the factor for each sample. Users can view the complete ranked list of putative target genes or search for a gene of interest by the gene symbol.

Queries

Cistrome DB contains two options for searching. One is through a selection list and the other is based on an advanced search menu. Users can select a sample of interest from a list of factors and biological sources, or search by factor name, cell type, GSM accession number or other keywords. Each search produces a table of matched samples. Users can then view detailed data annotations and analyze results by clicking on the table entry. Any sample of interest can be added to a batch view list and visualized in the UCSC and WashU genome browser.

Exploring data

Users can start their data explorations either by keyword search or by selecting a species and looking at lists of factors or biology sources. Searching produces a table of matched samples. Cistrome DB makes it easy to query one factor in multiple cell types or multiple factor types within a single cell type. Cistrome DB provides four layers of content for each sample. First, it provides a manually curated metadata annotation, including the species, factor name, biological source, citation and data accession number. Second, it presents analysis results, including a peak file, a read density file, motif scan results, putative target genes and summaries of the distribution of peaks across different genomic locus categories. Third, it provides comprehensive QC metrics at the read, peak and annotation levels. Finally, it provides functions to analyze and visualize these samples; users can directly send data to the Cistrome analysis pipeline (Cistrome AP) or load data to the UCSC and WashU genome browsers for visualization. Both single-sample and batch visualization are supported.

DISCUSSION AND FUTURE DIRECTIONS

We present Cistrome DB, the most comprehensive knowledgebase and data portal for ChIP-seq and chromatin accessibility data in human and mouse. Cistrome DB is a valuable resource for transcriptional and epigenetic regulation studies. Transcriptional regulation is a complex process, which is controlled by hundreds of TFs, cofactors and chromatin regulators (27). With the Cistrome DB data, users can systematically investigate patterns of transcription factor or chromatin regulator binding and histone modifications related to their research questions. Chromatin profiles in diverse cell types and under various experimental conditions can be used in tissue specific or cell developmental studies (28). Cistrome DB provides quality control guidelines for ChIP-seq, DNase-seq and ATAC-seq experiments that allow users to disregard low quality results. Experimentalists can use our historical quality control data to evaluate the quality of their own ChIP-seq or chromatin accessibility experiments. We update Cistrome DB on a regular basis to incorporate newly published ChIP-seq and chromatin accessibility samples. In the future, the utility of Cistrome DB will be improved in several ways. Cell information, QC metrics, TF type and histone modifications can be further classified. In addition we will also integrate Cistrome DB with other data. Cistrome DB is more than a data repository, it also allows users to visualize and explore the data. In comparison with other resources, Cistrome DB is by far the most comprehensive database for curated and analyzed ChIP-seq and chromatin accessibility data.

28 in total

1. High-resolution profiling of histone methylations in the human genome.

Authors: Artem Barski; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Dustin E Schones; Zhibin Wang; Gang Wei; Iouri Chepelev; Keji Zhao
Journal: Cell Date: 2007-05-18 Impact factor: 41.582

2. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

Review 3. Transcriptional regulation and its misregulation in disease.

Authors: Tong Ihn Lee; Richard A Young
Journal: Cell Date: 2013-03-14 Impact factor: 41.582

4. Architecture of the human regulatory network derived from ENCODE data.

Authors: Mark B Gerstein; Anshul Kundaje; Manoj Hariharan; Stephen G Landt; Koon-Kiu Yan; Chao Cheng; Xinmeng Jasmine Mu; Ekta Khurana; Joel Rozowsky; Roger Alexander; Renqiang Min; Pedro Alves; Alexej Abyzov; Nick Addleman; Nitin Bhardwaj; Alan P Boyle; Philip Cayting; Alexandra Charos; David Z Chen; Yong Cheng; Declan Clarke; Catharine Eastman; Ghia Euskirchen; Seth Frietze; Yao Fu; Jason Gertz; Fabian Grubert; Arif Harmanci; Preti Jain; Maya Kasowski; Phil Lacroute; Jing Jane Leng; Jin Lian; Hannah Monahan; Henriette O'Geen; Zhengqing Ouyang; E Christopher Partridge; Dorrelyn Patacsil; Florencia Pauli; Debasish Raha; Lucia Ramirez; Timothy E Reddy; Brian Reed; Minyi Shi; Teri Slifer; Jing Wang; Linfeng Wu; Xinqiong Yang; Kevin Y Yip; Gili Zilberman-Schapira; Serafim Batzoglou; Arend Sidow; Peggy J Farnham; Richard M Myers; Sherman M Weissman; Michael Snyder
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

5. Cistrome: an integrative platform for transcriptional regulation studies.

Authors: Tao Liu; Jorge A Ortiz; Len Taing; Clifford A Meyer; Bernett Lee; Yong Zhang; Hyunjin Shin; Swee S Wong; Jian Ma; Ying Lei; Utz J Pape; Michael Poidinger; Yiwen Chen; Kevin Yeung; Myles Brown; Yaron Turpaz; X Shirley Liu
Journal: Genome Biol Date: 2011-08-22 Impact factor: 13.583

6. CR Cistrome: a ChIP-Seq database for chromatin regulators and histone modification linkages in human and mouse.

Authors: Qixuan Wang; Jinyan Huang; Hanfei Sun; Jing Liu; Juan Wang; Qian Wang; Qian Qin; Shenglin Mei; Chengchen Zhao; Xiaoqin Yang; X Shirley Liu; Yong Zhang
Journal: Nucleic Acids Res Date: 2013-11-18 Impact factor: 16.971

7. ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.

Authors: Qian Qin; Shenglin Mei; Qiu Wu; Hanfei Sun; Lewyn Li; Len Taing; Sujun Chen; Fugen Li; Tao Liu; Chongzhi Zang; Han Xu; Yiwen Chen; Clifford A Meyer; Yong Zhang; Myles Brown; Henry W Long; X Shirley Liu
Journal: BMC Bioinformatics Date: 2016-10-03 Impact factor: 3.169

8. The accessible chromatin landscape of the human genome.

Authors: Robert E Thurman; Eric Rynes; Richard Humbert; Jeff Vierstra; Matthew T Maurano; Eric Haugen; Nathan C Sheffield; Andrew B Stergachis; Hao Wang; Benjamin Vernot; Kavita Garg; Sam John; Richard Sandstrom; Daniel Bates; Lisa Boatman; Theresa K Canfield; Morgan Diegel; Douglas Dunn; Abigail K Ebersol; Tristan Frum; Erika Giste; Audra K Johnson; Ericka M Johnson; Tanya Kutyavin; Bryan Lajoie; Bum-Kyu Lee; Kristen Lee; Darin London; Dimitra Lotakis; Shane Neph; Fidencio Neri; Eric D Nguyen; Hongzhu Qu; Alex P Reynolds; Vaughn Roach; Alexias Safi; Minerva E Sanchez; Amartya Sanyal; Anthony Shafer; Jeremy M Simon; Lingyun Song; Shinny Vong; Molly Weaver; Yongqi Yan; Zhancheng Zhang; Zhuzhu Zhang; Boris Lenhard; Muneesh Tewari; Michael O Dorschner; R Scott Hansen; Patrick A Navas; George Stamatoyannopoulos; Vishwanath R Iyer; Jason D Lieb; Shamil R Sunyaev; Joshua M Akey; Peter J Sabo; Rajinder Kaul; Terrence S Furey; Job Dekker; Gregory E Crawford; John A Stamatoyannopoulos
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

9. A quality control system for profiles obtained by ChIP sequencing.

Authors: Marco-Antonio Mendoza-Parra; Wouter Van Gool; Mohamed Ashick Mohamed Saleem; Danilo Guillermo Ceschin; Hinrich Gronemeyer
Journal: Nucleic Acids Res Date: 2013-09-14 Impact factor: 16.971

10. Biocuration of functional annotation at the European nucleotide archive.

Authors: Richard Gibson; Blaise Alako; Clara Amid; Ana Cerdeño-Tárraga; Iain Cleland; Neil Goodgame; Petra Ten Hoopen; Suran Jayathilaka; Simon Kay; Rasko Leinonen; Xin Liu; Swapna Pallreddy; Nima Pakseresht; Jeena Rajan; Marc Rosselló; Nicole Silvester; Dmitriy Smirnov; Ana Luisa Toribio; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2015-11-28 Impact factor: 16.971

199 in total

1. Loss of KMT2D induces prostate cancer ROS-mediated DNA damage by suppressing the enhancer activity and DNA binding of antioxidant transcription factor FOXO3.

Authors: Shidong Lv; Haoran Wen; Xiongwei Shan; Jianhua Li; Yaobin Wu; Xinpei Yu; Wenhua Huang; Qiang Wei
Journal: Epigenetics Date: 2019-06-28 Impact factor: 4.528

2. AR Signaling in Prostate Cancer Regulates a Feed-Forward Mechanism of Androgen Synthesis by Way of HSD3B1 Upregulation.

Authors: Daniel Hettel; Ao Zhang; Mohammad Alyamani; Michael Berk; Nima Sharifi
Journal: Endocrinology Date: 2018-08-01 Impact factor: 4.736

3. Epigenetic regulation of miR-518a-5p-CCR6 feedback loop promotes both proliferation and invasion in diffuse large B cell lymphoma.

Authors: Qian Huang; Feng Zhang; Haiying Fu; Jianzhen Shen
Journal: Epigenetics Date: 2020-06-30 Impact factor: 4.528

4. multiHiCcompare: joint normalization and comparative analysis of complex Hi-C experiments.

Authors: John C Stansfield; Kellen G Cresswell; Mikhail G Dozmorov
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

5. Fasting-Induced Transcription Factors Repress Vitamin D Bioactivation, a Mechanism for Vitamin D Deficiency in Diabetes.

Authors: Sanna-Mari Aatsinki; Mahmoud-Sobhy Elkhwanky; Outi Kummu; Mikko Karpale; Marcin Buler; Pirkko Viitala; Valtteri Rinne; Maija Mutikainen; Pasi Tavi; Andras Franko; Rudolf J Wiesner; Kari T Chambers; Brian N Finck; Jukka Hakkola
Journal: Diabetes Date: 2019-03-04 Impact factor: 9.461

6. Sharing DNA-binding information across structurally similar proteins enables accurate specificity determination.

Authors: Joshua L Wetzel; Mona Singh
Journal: Nucleic Acids Res Date: 2020-01-24 Impact factor: 16.971

7. Biology and Clinical Implications of the 19q13 Aggressive Prostate Cancer Susceptibility Locus.

Authors: Ping Gao; Ji-Han Xia; Csilla Sipeky; Xiao-Ming Dong; Qin Zhang; Yuehong Yang; Peng Zhang; Sara Pereira Cruz; Kai Zhang; Jing Zhu; Hang-Mao Lee; Sufyan Suleman; Nikolaos Giannareas; Song Liu; Teuvo L J Tammela; Anssi Auvinen; Xiaoyue Wang; Qilai Huang; Liguo Wang; Aki Manninen; Markku H Vaarala; Liang Wang; Johanna Schleutker; Gong-Hong Wei
Journal: Cell Date: 2018-07-19 Impact factor: 41.582

8. A Somatically Acquired Enhancer of the Androgen Receptor Is a Noncoding Driver in Advanced Prostate Cancer.

Authors: David Y Takeda; Sándor Spisák; Ji-Heui Seo; Connor Bell; Edward O'Connor; Keegan Korthauer; Dezső Ribli; István Csabai; Norbert Solymosi; Zoltán Szállási; David R Stillman; Paloma Cejas; Xintao Qiu; Henry W Long; Viktória Tisza; Pier Vitale Nuzzo; Mersedeh Rohanizadegan; Mark M Pomerantz; William C Hahn; Matthew L Freedman
Journal: Cell Date: 2018-06-14 Impact factor: 41.582

9. Genome-wide alterations of uracil distribution patterns in human DNA upon chemotherapeutic treatments.

Authors: Hajnalka L Pálinkás; Angéla Békési; Gergely Róna; Lőrinc Pongor; Gábor Papp; Gergely Tihanyi; Eszter Holub; Ádám Póti; Carolina Gemma; Simak Ali; Michael J Morten; Eli Rothenberg; Michele Pagano; Dávid Szűts; Balázs Győrffy; Beáta G Vértessy
Journal: Elife Date: 2020-09-21 Impact factor: 8.140

10. Tanshinone I, a new EZH2 inhibitor restricts normal and malignant hematopoiesis through upregulation of MMP9 and ABCG2.

Authors: Ying Huang; Shan-He Yu; Wen-Xuan Zhen; Tao Cheng; Dan Wang; Jie-Bo Lin; Yu-Han Wu; Yi-Fan Wang; Yi Chen; Li-Ping Shu; Yi Wang; Xiao-Jian Sun; Yi Zhou; Fan Yang; Chih-Hung Hsu; Peng-Fei Xu
Journal: Theranostics Date: 2021-05-08 Impact factor: 11.556