Literature DB >> 25953800

Identification of C2H2-ZF binding preferences from ChIP-seq data using RCADE.

Hamed S Najafabadi¹, Mihai Albu¹, Timothy R Hughes².

Abstract

UNLABELLED: Current methods for motif discovery from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data often identify non-targeted transcription factor (TF) motifs, and are even further limited when peak sequences are similar due to common ancestry rather than common binding factors. The latter aspect particularly affects a large number of proteins from the Cys2His2 zinc finger (C2H2-ZF) class of TFs, as their binding sites are often dominated by endogenous retroelements that have highly similar sequences. Here, we present recognition code-assisted discovery of regulatory elements (RCADE) for motif discovery from C2H2-ZF ChIP-seq data. RCADE combines predictions from a DNA recognition code of C2H2-ZFs with ChIP-seq data to identify models that represent the genuine DNA binding preferences of C2H2-ZF proteins. We show that RCADE is able to identify generalizable binding models even from peaks that are exclusively located within the repeat regions of the genome, where state-of-the-art motif finding approaches largely fail.
AVAILABILITY AND IMPLEMENTATION: RCADE is available as a webserver and also for download at http://rcade.ccbr.utoronto.ca/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: t.hughes@utoronto.ca.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2015 PMID： 25953800 PMCID： PMC4547615 DOI： 10.1093/bioinformatics/btv284

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most widely used method for mapping the genomic regions that are associated with transcription factors (TFs) (ENCODE Project Consortium, 2012). Identification of direct TF binding sites from ChIP-seq data is an essential step for decoding the molecular mechanisms that underlie the regulatory programs dictated by TFs, and understanding how genetic changes can affect these programs. In the absence of orthogonal information on DNA binding preferences of TFs, such as in vitro binding data, achieving this goal primarily depends on inference of a binding model (such as a DNA ‘motif’) from the ChIP-seq data. Current approaches for motif finding from ChIP-seq data almost exclusively rely on the assumption that the genomic regions associated with a particular TF have diverse sequences except at the sites that are directly bound by the TF, where the sequences are converged to match the TF binding preference. However, this assumption is violated in many cases, such as when the ChIP-seq peaks are dominated by binding sites of the interacting partners of the TF of interest, represent targets of multiple cooperative regulatory factors, and/or are enriched for repetitive DNA sequences such as endogenous retroelements (EREs). Binding to EREs with similar sequences particularly affects the ability of motif finding approaches for identification of DNA binding preferences of the Cys2His2 zinc finger (C2H2-ZF) class of TFs, which is by far the largest class of TFs in most vertebrates. The C2H2-ZF proteins constitute almost half of all human TFs, and almost half of them bind primarily to EREs (Najafabadi ). As a result, the motifs identified from the genomic regions that they bind often reflect the sequence homology among different instances of the associated ERE type, rather than the genuine binding preference of the C2H2-ZF protein. An alternative is to directly predict the binding preferences of C2H2-ZF TFs from their protein sequences. However, these predictions are often inaccurate (Gupta ; Najafabadi ; Persikov and Singh, 2014). In addition, not all of the C2H2-ZF domains within a protein participate in DNA binding at the same time, further complicating the task of predicting DNA preference from protein sequence. To address these issues, we present recognition code-assisted discovery of regulatory elements (RCADE), which combines predictions from a recent recognition code of C2H2-ZFs (Najafabadi ) with motif optimization based on ChIP-seq data to overcome limitations associated with current approaches, and also to identify regions of the C2H2-ZF protein that engage in DNA-binding.

2 Methods

RCADE examines the C2H2-ZF domains within a protein to identify stretches of adjacent zinc fingers, or zinc finger ‘arrays’, whose predicted binding sites (Najafabadi ) are enriched in ChIP-seq peaks relative to dinucleotide-shuffled sequences, indicating direct DNA binding. Then, RCADE optimizes the motifs to discriminate between the real and shuffled sequences (Fig. 1A). Briefly, for each predicted seed motif, RCADE identifies the sequences with the largest motif scores, and constructs a new Position Weight Matrix (PWM) by aligning the motif hits in these sequences, repeating this procedure until the PWM converges. The top-scoring optimized PWM is reported, along with the zinc fingers that are predicted to contribute to DNA-binding. The optimized motifs are almost always significantly similar to the original seed motifs, indicating that the optimization procedure does not depart drastically from the starting point. The RCADE algorithm is shown in more detail in Supplementary Figure S1.

Fig. 1.

RCADE workflow and benchmarking results. (A) RCADE starts by predicting a set of motifs from the target C2H2-ZF protein sequence, using a previously published bacterial-one-hybrid assay-based recognition code, or B1H-RC (Najafabadi ), which are evaluated against the ChIP-seq peak sequences to identify significantly enriched motifs, and are then iteratively optimized. (B) Benchmarking workflow for evaluation of RCADE. The peak sequences were divided into two sets of ERE-overlapping and non-ERE peaks. The ERE-overlapping peaks for each protein were used for motif discovery using RCADE, and the motifs were validated using non-ERE peaks. (C,D) Validation results for 18 ERE-binding proteins. The arrows show the improvement in the AUROC of RCADE motifs compared with seed B1H-RC motifs. (E) Example motifs for two proteins that show the largest difference between RCADE and MEME validation results. The top-scoring MEME motif is shown for each protein, followed by the top-scoring motif that is directly predicted from protein sequence using the B1H-RC, and the RCADE optimized motif. The Pearson similarity of the B1H-RC and RCADE motifs was calculated as described previously (Najafabadi )

3 Benchmarking

To evaluate the performance of RCADE in identifying correct motifs from highly similar sequences, we applied it to the set of ChIP-seq data for all the 18 human proteins shown to bind to EREs in a previous study (Najafabadi ). We identified the 500 most enriched ERE-overlapping peak summits as well as the 500 most enriched non-ERE peak summits for each dataset, and trained the RCADE motifs exclusively on the ERE-overlapping peaks. Since EREs have highly similar sequences due to common ancestry, it is very difficult to distinguish the correct TF motifs from unrelated enriched sequences, and therefore, most current motif finding approaches are expected to perform poorly. We used MEME (Bailey and Elkan, 1994) for comparison, as it is one of the most widely used motif finding methods. The motifs that were trained on the ERE-overlapping peaks were evaluated using non-ERE peaks (Fig. 1B), to confirm that RCADE does not overfit the motifs on the EREs. Non-ERE sequences are not expected to be similar due to common ancestry, and therefore, motif enrichment is an indicative of biological relevance. The RCADE motifs generally showed considerably better enrichment at the center of the non-ERE peaks compared with MEME motifs, as evaluated by CentriMo (Bailey and Machanick, 2012) (Fig. 1C). Furthermore, many RCADE motifs are significantly better than MEME motifs at distinguishing non-ERE peaks from dinucleotide-shuffled sequences (Fig. 1D). Two prominent examples of such motifs are shown in Figure 1E. Further validation results are shown in Supplementary Figures S3–S7. We note that in addition to its utility for motif derivation, RCADE pinpoints the C2H2-ZF domains that engage DNA. While RCADE currently supports only the C2H2-ZF class of TFs, its concept can also be applied to other TF classes as long as a suitable recognition code exists.

Funding

This work was supported by grants from the Canadian Institutes of Health Research (MOP-77721 and MOP-111007), and funding from Canadian Institute for Advanced Research to T.R.H. H.S.N. was supported by a Canadian Institutes of Health Research Banting Fellowship. Conflict of Interest: none declared.

6 in total

1. C2H2 zinc finger proteins greatly expand the human regulatory lexicon.

Authors: Hamed S Najafabadi; Sanie Mnaimneh; Frank W Schmitges; Michael Garton; Kathy N Lam; Ally Yang; Mihai Albu; Matthew T Weirauch; Ernest Radovani; Philip M Kim; Jack Greenblatt; Brendan J Frey; Timothy R Hughes
Journal: Nat Biotechnol Date: 2015-02-18 Impact factor: 54.908

2. Fitting a mixture model by expectation maximization to discover motifs in biopolymers.

Authors: T L Bailey; C Elkan
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1994

3. De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins.

Authors: Anton V Persikov; Mona Singh
Journal: Nucleic Acids Res Date: 2013-10-03 Impact factor: 16.971

4. Inferring direct DNA binding from ChIP-seq.

Authors: Timothy L Bailey; Philip Machanick
Journal: Nucleic Acids Res Date: 2012-05-18 Impact factor: 16.971

5. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

6. An improved predictive recognition model for Cys(2)-His(2) zinc finger proteins.

Authors: Ankit Gupta; Ryan G Christensen; Heather A Bell; Mathew Goodwin; Ronak Y Patel; Manishi Pandey; Metewo Selase Enuameh; Amy L Rayla; Cong Zhu; Stacey Thibodeau-Beganny; Michael H Brodsky; J Keith Joung; Scot A Wolfe; Gary D Stormo
Journal: Nucleic Acids Res Date: 2014-02-12 Impact factor: 16.971

6 in total

8 in total

Review 1. Low-Affinity Binding Sites and the Transcription Factor Specificity Paradox in Eukaryotes.

Authors: Judith F Kribelbauer; Chaitanya Rastogi; Harmen J Bussemaker; Richard S Mann
Journal: Annu Rev Cell Dev Biol Date: 2019-07-05 Impact factor: 13.827

2. Discovering unknown human and mouse transcription factor binding sites and their characteristics from ChIP-seq data.

Authors: Chun-Ping Yu; Chen-Hao Kuo; Chase W Nelson; Chi-An Chen; Zhi Thong Soh; Jinn-Jy Lin; Ru-Xiu Hsiao; Chih-Yao Chang; Wen-Hsiung Li
Journal: Proc Natl Acad Sci U S A Date: 2021-05-18 Impact factor: 11.205

3. Multiparameter functional diversity of human C2H2 zinc finger proteins.

Authors: Frank W Schmitges; Ernest Radovani; Hamed S Najafabadi; Marjan Barazandeh; Laura F Campitelli; Yimeng Yin; Arttu Jolma; Guoqing Zhong; Hongbo Guo; Tharsan Kanagalingam; Wei F Dai; Jussi Taipale; Andrew Emili; Jack F Greenblatt; Timothy R Hughes
Journal: Genome Res Date: 2016-11-16 Impact factor: 9.043

4. Comparison of ChIP-Seq Data and a Reference Motif Set for Human KRAB C2H2 Zinc Finger Proteins.

Authors: Marjan Barazandeh; Samuel A Lambert; Mihai Albu; Timothy R Hughes
Journal: G3 (Bethesda) Date: 2018-01-04 Impact factor: 3.154

5. C2H2 Zinc Finger Proteins: The Largest but Poorly Explored Family of Higher Eukaryotic Transcription Factors.

Authors: A A Fedotova; A N Bonchuk; V A Mogila; P G Georgiev
Journal: Acta Naturae Date: 2017 Apr-Jun Impact factor: 1.845

6. Toward a base-resolution panorama of the in vivo impact of cytosine methylation on transcription factor binding.

Authors: Aldo Hernandez-Corchado; Hamed S Najafabadi
Journal: Genome Biol Date: 2022-07-07 Impact factor: 17.906

7. Regulatory variants at KLF14 influence type 2 diabetes risk via a female-specific effect on adipocyte size and body composition.

Authors: Kerrin S Small; Marijana Todorčević; Mete Civelek; Julia S El-Sayed Moustafa; Xiao Wang; Michelle M Simon; Juan Fernandez-Tajes; Anubha Mahajan; Momoko Horikoshi; Alison Hugill; Craig A Glastonbury; Lydia Quaye; Matt J Neville; Siddharth Sethi; Marianne Yon; Calvin Pan; Nam Che; Ana Viñuela; Pei-Chien Tsai; Abhishek Nag; Alfonso Buil; Gudmar Thorleifsson; Avanthi Raghavan; Qiurong Ding; Andrew P Morris; Jordana T Bell; Unnur Thorsteinsdottir; Kari Stefansson; Markku Laakso; Ingrid Dahlman; Peter Arner; Anna L Gloyn; Kiran Musunuru; Aldons J Lusis; Roger D Cox; Fredrik Karpe; Mark I McCarthy
Journal: Nat Genet Date: 2018-04-09 Impact factor: 38.330

Review 8. PRDM9, a driver of the genetic map.

Authors: Corinne Grey; Frédéric Baudat; Bernard de Massy
Journal: PLoS Genet Date: 2018-08-30 Impact factor: 5.917

8 in total