Literature DB >> 30352610

Genome-wide map of human and mouse transcription factor binding sites aggregated from ChIP-Seq data.

Ilya E Vorontsov1, Alla D Fedorova1, Ivan S Yevshin2, Ruslan N Sharipov2,3,4, Fedor A Kolpakov2,3, Vsevolod J Makeev1,5,6, Ivan V Kulakovskiy7,8,9.   

Abstract

OBJECTIVES: Mammalian genomics studies, especially those focusing on transcriptional regulation, require information on genomic locations of regulatory regions, particularly, transcription factor (TF) binding sites. There are plenty of published ChIP-Seq data on in vivo binding of transcription factors in different cell types and conditions. However, handling of thousands of separate data sets is often impractical and it is desirable to have a single global map of genomic regions potentially bound by a particular TF in any of studied cell types and conditions. DATA DESCRIPTION: Here we report human and mouse cistromes, the maps of genomic regions that are routinely identified as TF binding sites, organized by TF. We provide cistromes for 349 mouse and 599 human TFs. Given a TF, its cistrome regions are supported by evidence from several ChIP-Seq experiments or several computational tools, and, as an optional filter, contain occurrences of sequence motifs recognized by the TF. Using the cistrome, we provide an annotation of TF binding sites in the vicinity of human and mouse transcription start sites. This information is useful for selecting potential gene targets of transcription factors and detecting co-regulated genes in differential gene expression data.

Entities:  

Keywords:  ChIP-Seq; Cistrome; Human and mouse; Regulatory regions; Target genes; Transcription factor binding sites

Mesh:

Substances:

Year:  2018        PMID: 30352610      PMCID: PMC6199713          DOI: 10.1186/s13104-018-3856-x

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Objective

Locations of genomic regions responsible for transcriptional regulation are valuable for many genomics and genetics studies, from analysis of gene regulatory networks to prediction of the functional impact of non-coding genetic variants. There are thousands of experimental data sets related to in vivo binding of human and mouse transcription factors (TFs), with chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) as the gold standard method. Many existing databases such as GTRD [1] or ReMAP [2] focus on systematic reprocessing and user-friendly access to published ChIP-Seq data. By design, ChIP-Seq data provide information on cell-type specific binding. Binding profiles of the same TF can be quite different in different cell types or experimental conditions, and, for a particular transcription factor, it is not always feasible to separately analyze hundreds of individual data sets. Instead, a list of reproducible TF binding regions routinely identified as TF binding sites could be valuable for preliminary selection of putative target genes or for the discovery of key regulators for differentially expressed genes. For example, meta-clusters provided in GTRD are the genome segments bound by the studied TF across several data sets. For practical applications it is useful to have such constituent binding regions being separated into different reproducibility categories, and annotated with occurrences of transcription factor binding motifs, to highlight likely genuine binding sites of each particular TF. In particular, transcription factor binding sites in the vicinity of the transcription start site are of special interest allowing to identify putative TF target genes. Finally, it would be convenient to have available genomic coordinates of a TF binding map for each of several commonly used genome releases.

Data description

Here we present the human and mouse cistromes [3], the genome-wide maps of regions bound by TFs, obtained through systematic analysis of ChIP-Seq data. The cistromes include data for 349 mouse and 599 human TFs. Cistromes provide an important information layer for detecting putative target genes of the corresponding TFs, for detecting regulators bound to known promoters and enhancers, and for intersection and enrichment analysis of various genomic features including regulatory sequence variants. For each TF, the cistrome consists of sets of non-overlapping regions with assigned reliability categories. For convenience, we provide genome-wide (Table 1, Data set 1–4) and gene-centric (Table 1, Data set 5–8) maps for two major human (hg19, hg38) and mouse (mm9, mm10) genome assemblies [4, 5]. The genome-wide map contains global genomic coordinates of TF binding regions. The gene-centric map contains the relative locations of the nearest cistrome segments for each transcription start site (TSS).
Table 1

Overview of data files/data sets

LabelName of data file/data setFile types (file extension)Data repository and identifier (DOI or accession number)
Data set 1cistrome_hg19.zipArchive file (.zip) containing genomic regions files (.bed)Figshare (10.6084/m9.figshare.7087697)
Data set 2cistrome_hg38.zipArchive file (.zip) containing genomic regions files (.bed)Figshare (10.6084/m9.figshare.7087697)
Data set 3cistrome_mm9.zipArchive file (.zip) containing genomic regions files (.bed)Figshare (10.6084/m9.figshare.7087697)
Data set 4cistrome_mm10.zipArchive file (.zip) containing genomic regions files (.bed)Figshare (10.6084/m9.figshare.7087697)
Data set 5cistrome2genes_hg19.zipArchive file (.zip) containing tab-separated files (.tsv)Figshare (10.6084/m9.figshare.7087697)
Data set 6cistrome2genes_hg38.zipArchive file (.zip) containing tab-separated files (.tsv)Figshare (10.6084/m9.figshare.7087697)
Data set 7cistrome2genes_mm9.zipArchive file (.zip) containing tab-separated files (.tsv)Figshare (10.6084/m9.figshare.7087697)
Data set 8cistrome2genes_mm10.zipArchive file (.zip) containing tab-separated files (.tsv)Figshare (10.6084/m9.figshare.7087697)
Data file 1cistrome_overview.xlsxMS Excel file (.xslx)Figshare (10.6084/m9.figshare.7087697)
Overview of data files/data sets

Cistrome aggregation and motif annotation

The initial set of TF binding regions from ChIP-Seq (the ChIP-Seq peaks) was extracted from GTRD (release 17 April 2017). GTRD provided ChIP-Seq peak calls from four different peak calling software (see Data file 1 for details) executed with default parameters. Using the approach described in [6], some data sets were excluded as unreliable. Then we applied BEDTools 2.26.0 [7] to merge the overlapping intervals from different experiments and ChIP-Seq peak callers. The resulting regions were classified into four reliability categories in the following manner: A (the highest reliability, experimental and technical reproducibility): this group contains cistrome regions consisting of overlapping peaks detected in at least two experimental data sets and by at least two peak calling tools, i.e. supported by at least two experiments and at least two peak callers; B (high reliability, experimental reproducibility): regions supported by at least two experiments; C (medium reliability, technical reproducibility): regions supported by at least two peak callers. For segments of A, B, and C sub-cistromes, it is required that each segment overlaps at least one ChIP-Seq peak from a data set that was accompanied by the experimental control data. All other reproducible segments fall into D category (limited reliability). The technical details of the cistrome construction and overall statistics of the cistrome are provided in the Data file 1 (see Table 1). All the cistrome categories were annotated by predictions of TF binding sites with HOCOMOCO v11 [6] sequence motifs to obtain a subset of regions with genuine binding sites recognized by a particular TF. Data for human hg38 and mouse mm10 genome assemblies was produced directly from GTRD peak calls, data for human hg19 and mouse mm9 assemblies was produced with liftOver (v353) [8]. A (the best constitutively bound sites) and joint ABC (the compromise) cistromes are the most informative. We used those along with the motif annotation to construct a gene-centric map of TF binding (Table 1, Data set 5–8) using the GENCODE [9] annotation (GTF, main annotation files) and PyRanges 0.0.13 [10]. For each TSS, the gene-centric map contains the absolute distance from a TSS to the nearest cistrome segment corresponding to binding of a particular TF.

Limitations

The cistrome lacks metadata regarding cell types, antibodies or experimental conditions. For studies of particular genes or particular binding sites, the user is advised to address a detailed database, such as GTRD. The cistrome coverage and reliability heavily depends on a volume of experimental data available for a particular TF. In the presented map, many TFs have very sparse maps with only a few bound regions or only low-reliability cistrome categories, or with no cistrome regions at all. For many TFs it was not possible to perform the motif annotation due to absence of reliable information on binding sequence preferences, the corresponding entries are explicitly marked in the gene-centric map.
  8 in total

1.  Initial sequencing and analysis of the human genome.

Authors:  E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal:  Nature       Date:  2001-02-15       Impact factor: 49.962

2.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.

Authors:  Ivan V Kulakovskiy; Ilya E Vorontsov; Ivan S Yevshin; Ruslan N Sharipov; Alla D Fedorova; Eugene I Rumynskiy; Yulia A Medvedeva; Arturo Magana-Mora; Vladimir B Bajic; Dmitry A Papatsenko; Fedor A Kolpakov; Vsevolod J Makeev
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

3.  BEDTools: a flexible suite of utilities for comparing genomic features.

Authors:  Aaron R Quinlan; Ira M Hall
Journal:  Bioinformatics       Date:  2010-01-28       Impact factor: 6.937

4.  GENCODE: the reference human genome annotation for The ENCODE Project.

Authors:  Jennifer Harrow; Adam Frankish; Jose M Gonzalez; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen L Aken; Daniel Barrell; Amonida Zadissa; Stephen Searle; If Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles Steward; Rachel Harte; Michael Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael Tress; Jose Manuel Rodriguez; Iakes Ezkurdia; Jeltje van Baren; Michael Brent; David Haussler; Manolis Kellis; Alfonso Valencia; Alexandre Reymond; Mark Gerstein; Roderic Guigó; Tim J Hubbard
Journal:  Genome Res       Date:  2012-09       Impact factor: 9.043

5.  Modernizing reference genome assemblies.

Authors:  Deanna M Church; Valerie A Schneider; Tina Graves; Katherine Auger; Fiona Cunningham; Nathan Bouk; Hsiu-Chuan Chen; Richa Agarwala; William M McLaren; Graham R S Ritchie; Derek Albracht; Milinn Kremitzki; Susan Rock; Holland Kotkiewicz; Colin Kremitzki; Aye Wollam; Lee Trani; Lucinda Fulton; Robert Fulton; Lucy Matthews; Siobhan Whitehead; Will Chow; James Torrance; Matthew Dunn; Glenn Harden; Glen Threadgold; Jonathan Wood; Joanna Collins; Paul Heath; Guy Griffiths; Sarah Pelan; Darren Grafham; Evan E Eichler; George Weinstock; Elaine R Mardis; Richard K Wilson; Kerstin Howe; Paul Flicek; Tim Hubbard
Journal:  PLoS Biol       Date:  2011-07-05       Impact factor: 8.029

6.  The UCSC Genome Browser Database: update 2006.

Authors:  A S Hinrichs; D Karolchik; R Baertsch; G P Barber; G Bejerano; H Clawson; M Diekhans; T S Furey; R A Harte; F Hsu; J Hillman-Jackson; R M Kuhn; J S Pedersen; A Pohl; B J Raney; K R Rosenbloom; A Siepel; K E Smith; C W Sugnet; A Sultan-Qurraie; D J Thomas; H Trumbower; R J Weber; M Weirauch; A S Zweig; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

7.  GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments.

Authors:  Ivan Yevshin; Ruslan Sharipov; Tagir Valeev; Alexander Kel; Fedor Kolpakov
Journal:  Nucleic Acids Res       Date:  2016-10-24       Impact factor: 16.971

8.  ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments.

Authors:  Jeanne Chèneby; Marius Gheorghe; Marie Artufel; Anthony Mathelier; Benoit Ballester
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

  8 in total
  11 in total

1.  Spatiotemporal specificity of correlated DNA methylation and gene expression pairs across different human tissues and stages of brain development.

Authors:  Kangli Wang; Rujia Dai; Yan Xia; Jianghua Tian; Chuan Jiao; Tatiana Mikhailova; Chunling Zhang; Chao Chen; Chunyu Liu
Journal:  Epigenetics       Date:  2021-10-21       Impact factor: 4.861

Review 2.  Unravelling the complex genetics of common kidney diseases: from variants to mechanisms.

Authors:  Katie Marie Sullivan; Katalin Susztak
Journal:  Nat Rev Nephrol       Date:  2020-06-08       Impact factor: 28.314

3.  Pervasive and CpG-dependent promoter-like characteristics of transcribed enhancers.

Authors:  Robin Steinhaus; Tonatiuh Gonzalez; Dominik Seelow; Peter N Robinson
Journal:  Nucleic Acids Res       Date:  2020-06-04       Impact factor: 16.971

4.  Towards the mechanism(s) of YB-3 synthesis regulation by YB-1.

Authors:  D N Lyabin; E A Smolin; K S Budkina; I A Eliseeva; L P Ovchinnikov
Journal:  RNA Biol       Date:  2020-12-27       Impact factor: 4.652

5.  GTRD: a database on gene transcription regulation-2019 update.

Authors:  Ivan Yevshin; Ruslan Sharipov; Semyon Kolmykov; Yury Kondrakhin; Fedor Kolpakov
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

6.  A holistic view of mouse enhancer architectures reveals analogous pleiotropic effects and correlation with human disease.

Authors:  Siddharth Sethi; Ilya E Vorontsov; Ivan V Kulakovskiy; Simon Greenaway; John Williams; Vsevolod J Makeev; Steve D M Brown; Michelle M Simon; Ann-Marie Mallon
Journal:  BMC Genomics       Date:  2020-11-02       Impact factor: 3.969

7.  High throughput screening identifies SOX2 as a super pioneer factor that inhibits DNA methylation maintenance at its binding sites.

Authors:  Ludovica Vanzan; Hadrien Soldati; Victor Ythier; Santosh Anand; Simon M G Braun; Nicole Francis; Rabih Murr
Journal:  Nat Commun       Date:  2021-06-07       Impact factor: 14.919

8.  CpG traffic lights are markers of regulatory regions in human genome.

Authors:  Anna V Lioznova; Abdullah M Khamis; Artem V Artemov; Elizaveta Besedina; Vasily Ramensky; Vladimir B Bajic; Ivan V Kulakovskiy; Yulia A Medvedeva
Journal:  BMC Genomics       Date:  2019-02-01       Impact factor: 3.969

9.  Multiple selective sweeps of ancient polymorphisms in and around LTα located in the MHC class III region on chromosome 6.

Authors:  Michael C Campbell; Bryan Ashong; Shaolei Teng; Jayla Harvey; Christopher N Cross
Journal:  BMC Evol Biol       Date:  2019-12-02       Impact factor: 3.260

10.  Predictive features of gene expression variation reveal mechanistic link with differential expression.

Authors:  Olga M Sigalova; Amirreza Shaeiri; Mattia Forneris; Eileen Em Furlong; Judith B Zaugg
Journal:  Mol Syst Biol       Date:  2020-08       Impact factor: 11.429

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.