Literature DB >> 24243859

Protein-DNA binding: complexities and multi-protein codes.

Abstract

Binding of proteins to particular DNA sites across the genome is a primary determinant of specificity in genome maintenance and gene regulation. DNA-binding specificity is encoded at multiple levels, from the detailed biophysical interactions between proteins and DNA, to the assembly of multi-protein complexes. At each level, variation in the mechanisms used to achieve specificity has led to difficulties in constructing and applying simple models of DNA binding. We review the complexities in protein-DNA binding found at multiple levels and discuss how they confound the idea of simple recognition codes. We discuss the impact of new high-throughput technologies for the characterization of protein-DNA binding, and how these technologies are uncovering new complexities in protein-DNA recognition. Finally, we review the concept of multi-protein recognition codes in which new DNA-binding specificities are achieved by the assembly of multi-protein complexes.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
DNA-Binding Proteins
DNA

Year: 2013 PMID： 24243859 PMCID： PMC3936734 DOI： 10.1093/nar/gkt1112

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In the mid 1970s, motivated by early X-ray crystal structures of proteins and DNA, Seeman et al. (1) proposed a protein–DNA recognition code based on hydrogen bonding patterns between amino acids and bases. For example, in the DNA major groove, an arginine side-chain can make two hydrogen bonds with a guanine base, but not with any other base. Therefore, an Arg-Gua residue base pairing provides a mechanism—or code—for proteins to preferentially select for guanines. Although aesthetically pleasing, 30 years of subsequent analyses of protein–DNA complexes and interactions have demonstrated myriad complications with such a simple recognition code. We briefly summarize findings from structural and biochemical studies that explain why simple protein–DNA recognition codes do not exist.

Side-chain flexibility and water molecules

Protein side-chains and water molecules in the binding interface create a flexible network of interactions that can readily adopt new conformations in response to changes in DNA or protein sequence (2–5). Two examples illustrate the relevant complexities. Comparing protein–DNA crystal structures for wild-type and mutant zinc finger protein Zif268/Egr1, it was demonstrated that a single amino acid mutation (Zif268 D20A) could lead to altered protein side-chain conformations and DNA-binding preferences (6). Further, the side-chain re-arrangement occurred without a change in the protein docking geometry, but resulted in a new, positioned water molecule in the binding interface for one of the DNA sequences. Structural comparison of two Cre recombinase variants in complex with different DNA sequences revealed that both DNA and protein differences affect the contacts made in the binding interface (7). In complex with the same LoxM7 DNA sequence, two Cre variants, which differ at only three residue positions, had distinctly different conformations for the common side-chains mediating base contacts. Further, comparison of one of the variants bound to a different LoxP DNA site revealed that recognition of the two DNA sequences, LoxM7 and LoxP, led to altered side-chain conformations and different side-chain and water-mediated contacts. These studies highlight the complex interplay between side-chains, water molecules and DNA bases in the binding interface.

DNA structure and indirect readout

‘Direct readout’ (also known as ‘base readout’) refers to the situation where proteins discriminate different bases in a DNA sequence via direct (or water-mediated) interactions with the DNA bases. In contrast, ‘indirect readout’ (also known as ‘shape readout’) describes a different mechanism of recognition in which the sequence-dependent deformability or structural differences between DNA molecules contribute to their discrimination (8). A recent review categorized different types of shape readout and distinguished between ‘local’ shape readout, in which deviations from B-form DNA are localized along the sequence, and ‘global’ shape readout, in which much of the DNA molecule is deformed or bent (2). In local shape readout, sequence-dependent kinks in the DNA molecule or localized deviations in DNA groove width can lead to preferential binding to the cognate protein. Examples of widely used types of local shape readout are the preferential binding of arginine and histidine residues into narrow DNA minor grooves DNA (9,10) or binding to DNA kinks observed at YpR steps, such as seen for TATA-box binding protein (9,11,12). In global shape readout, larger deformations are involved. For example, a mechanism proposed for DNA binding of the human papillomavirus E2 protein is that DNA sequences in the unbound form adopt global conformations similar to that found in the bound protein–DNA complex (13). A particular DNA sequence that was already ‘pre-bent’ would require less energy to deform and hence be preferentially bound by a protein (2,13). What complicates prediction of the effects of indirect readout on the binding specificity is that DNA deformation is a structural property that involves multiple, distributed interactions (residue–base and base–base) and is sensitive to the conformation adopted in the bound complex. Even local shape readout (e.g., the local narrowing of the DNA minor grove) involves integrating the individual structural propensities over multiple bases (2). Therefore, a recognition code for many proteins will need to include or account for the detailed structure of the bound protein–DNA complex.

Docking geometry and spatial relationships

Energetically favorable residue–base interactions, particularly hydrogen bond interactions, depend on the relative spatial orientation of the protein residue and the DNA base (1,14–16). Subsequently, the docking geometry of a protein–DNA complex, defined by the orientation of the protein backbone relative to the DNA, is critical to establishing favorable amino acid–base interactions (14,15,17). Comparisons of the docking geometry for different protein–DNA complexes have revealed complications to a simple recognition code. First, the docking geometry of different protein folds can differ dramatically; therefore, favorable interactions made by different residue types will depend strongly on the protein fold (i.e. not all arginines will be able to make optimal contacts with a guanine). Second, even structurally homologous proteins can dock on the DNA in different orientations (14,15,17,18). Therefore, the distributed protein–DNA interactions that contribute to the overall docking geometry, both base-specific and non-specific (i.e., with the DNA backbone), can all potentially affect the protein–DNA interactions. In other words, effects are non-local and residues distributed throughout the protein will affect DNA-binding specificity. Studies have also shown that the docking geometry of the same protein can vary when bound to different DNA sequences (14,18,19). In an illustrative case, crystal structures of the nuclear factor kappa B (NF-κB)-family RelA homodimer in complex with different DNA sequences exhibited markedly different orientations for the dimer subunits. In the structure of RelA homodimer bound to the pseudo-symmetric sequence 5′-GGAA(A)TTTC-3′, one subunit makes a canonical set of hydrogen bond interactions with the 5′-GGAA half-site sequence. In contrast, the other subunit undergoes a large rotation and translation and makes no base-specific contacts with the apposing 5′-GAAA half-site, yet allows the protein to bind well to DNA (20). This contrasts starkly with the structure of RelA bound to the more symmetric 5′-GGAA(T)TTCC-3′ site where the same docking and DNA-base contacts are symmetric for the two subunits (21). Flexibility of the protein docking geometry in response to different DNA sequences presents a major difficulty in predicting the interactions and subsequent DNA-binding specificity of a protein (17,22). The complexities described here that result from flexible protein–DNA interactions, indirect DNA readout and protein docking geometry make it difficult to establish simple recognition rules that can provide a comprehensive description of protein–DNA binding. Structural and biochemical studies have now identified the major specificity-determining residues for many different protein structural families and subfamilies, providing a detailed picture of the basic mechanisms of DNA recognition used by different proteins (2,3). However, despite these structural and mechanistic descriptions, simple recognition codes have not emerged that can accurately characterize the binding affinities of a protein to the vast number of potential DNA-binding site sequences and has spurred the development of more complex models (23–26). In the last 10 years, newly developed high-throughput (HT) technologies have revolutionized our ability to characterize protein–DNA binding specificity. Such technologies include: bacterial one-hybrid (B1H) (27,28), protein-binding microarrays (PBMs) (29–33), total internal reflectance fluorescence-PBM (34), mechanically induced trapping of molecular interactions (MITOMI) (35), Bind-n-seq (36), EMSA-seq (37), HT-SELEX/SELEX-seq (38–40), microarray evaluation of genomic aptamers by shift (MEGAshift) (41), cognate site identifier (CSI) (42), and HT sequencing fluorescent ligand interaction profiling (HiTS-FLIP) (43). The rich datasets provided by these new HT technologies are facilitating the development of improved models of protein–DNA recognition (44,45), while at the same time revealing new complications for models of binding. Here we will briefly review the most widely used models and outline features of protein–DNA binding that complicate their development and application.

COMPLEXITIES IN PROTEIN–DNA RECOGNITION

Consensus sequences in which base preferences are represented using letters (e.g., A = Ade, Y = Cyt or Thy) are widely used to represent protein–DNA binding specificity. These models are appropriate for high-specificity DNA-interacting proteins such as restriction enzymes. For example, enzyme HincII will cut DNA sites that match the consensus GTYRAC (R = Ade or Gua), but the affinity for all other DNA sites is orders of lower magnitude (46). However, transcription factor (TF)–DNA binding is much more degenerate and is often characterized by a gradual transition from high to low affinity sites (24,29,35,43,46–48). To account for the sequence degeneracy in TF binding, the concept of position weight matrices (PWMs) was proposed and remains the most widely used representation of TF-binding specificity (46,49). A PWM is a matrix of scores (or weights) for each DNA base pair along a binding site. PWM models can be learned from a variety of data types, from small collections of known binding sites to large datasets generated using HT technologies. PWMs also have the benefit of being easy to visualize as DNA logos, providing an intuitive feel for the TF-binding specificity (50). One drawback of the PWM formalism is that it makes the implicit assumption that individual base pairs within a binding site contribute independently to the protein–DNA binding affinity. It has been shown that this assumption does not always hold (51–53), and more complex models of protein–DNA specificity have been developed to account for position dependencies within protein–DNA binding sites (52,54–56). Some studies have focused on extending traditional PWM models into ‘higher-order’ models by including contributions from dinucleotides and trinucleotides (45,57–62). A drawback of including non-independent contributions is that the number of model parameters increases significantly, which makes the models harder to learn and prone to overfitting. Thus, selecting the relevant dinucleotides and trinucleotides is critical for building higher-order models that take into account dependencies between positions in TF-binding sites (58,60). We note that other approaches have also been proposed to model DNA-binding specificity, however, a full survey of these various approaches is outside the scope of this review. Beyond inherent sequence degeneracy and positional dependencies, several other aspects of protein–DNA binding complicate the development and application of binding models and are reviewed below.

Low-affinity binding sites

Regardless of how degeneracy is represented in protein–DNA binding models, it is critically important to be able to capture the full breadth of binding sites used in vivo. Complicating this goal is the fact that TFs can specifically utilize low-affinity DNA-binding sites to regulate genes. TF binding to low-affinity DNA sites can provide a mechanism for interpreting both spatial (63–66) and temporal (67,68) TF gradients that often arise during development to control where and when genes are expressed. Analysis of genome-wide binding data has also provided evidence that low-affinity sites are under wide-spread evolutionary selection (69,70) and that their inclusion can greatly improve quantitative models of TF binding and gene regulation used for predicting segmentation patterns during early embryonic development in Drosophila (71). Utilization of sites selected to be lower affinity than an optimal sequence opens the door for functionally relevant sites to deviate strongly from the consensus sequence and may not be well represented by a particular binding model. For example, a comprehensive analysis of DNA binding by NF-κB dimers identified numerous lower affinity, non-traditional sites that differ significantly from the consensus sites and are not captured by the widely used PWMs (37,72). Disagreement with a PWM model may be due to: (i) a protein having multiple binding modes, which will require multiple PWMs (discussed more below) or (ii) poor or biased parameterization of the PWM model. PWMs can capture low-affinity binding sites but must be explicitly parameterized using low-affinity binding data (59). In either case, by focusing only on the highest affinity sites, protein–DNA binding models are likely to miss functionally important interactions that occur through lower affinity sites; however, making binding models too flexible or encompassing, without proper parameterization, runs the risk of increasing the rate of false-positive predictions.

TF-specific preferences

Another complexity highlighted by recent HT protein–DNA binding studies is that closely related TFs often exhibit both common and TF-specific binding preferences (23,24,35,37,40,72–79). In a study of 104 mouse TFs, it was found that even proteins with a high degree of similarity in their DNA-binding domains (DBDs) (as high as 67% amino acid sequence identity) can exhibit distinct DNA-binding profiles (24). In many cases, the highest affinity site is identical for several members of a DBD family, but individual proteins within the family have different preferences for lower affinity sites. For example, mouse TFs Irf4 and Irf5 share a strong preference for sequences containing the 5′-CGAAAC-3′ site, but prefer, albeit with lower affinity, 5′- TGAAAG-3′ or 5′-CGAGAC-3′, respectively. Homeodomain proteins have been extensively studied using HT approaches (23,77,78), revealing rich and diverse sequence preferences for members even within protein subfamilies. For example, the Lhx subfamily binds with high affinity sites containing the core 5′-TAATTA-3′, but different subfamily members have different preferences for medium- and low-affinity sites: Lhx2 prefers 5′-TAATGA-3′ and 5′-TAACGA-3′, whereas Lhx4 prefers 5′-TAACGA-3′ and 5′-TAATCT-3′. These TF-specific preferences cause difficulties for DNA-binding models, as the models must accurately characterize the common binding sites while at the same time capture the sites specific to each TF.

Flanking DNA

Even in the case of TFs with well-defined core DNA binding sites, complexities can still arise from the DNA sequence flanking the core, which can affect binding affinity. Two such examples have been highlighted in recent HT studies (45). The TF Gcn4 binds to the core motif 5′-TGACTCA-3′. Binding measurements of Gcn4 to thousands of sequences containing this core motif revealed a wide range of binding affinities, from no appreciable binding to high-affinity binding (43). Nucleotides immediately adjacent to the core 7-mer were important for binding affinity, but even the two nucleotides flanking the best 9-mer affected affinity by an order of magnitude. Another recent study examined the DNA-binding preferences of the paralogous yeast TFs Cbf1 and Tye7 to hundreds of genomic sequences containing the E-box motif 5′-CACGTG-3′ (45). The study revealed a strong dependence on flanking DNA and computational analyses suggested that the sequence beyond the canonical E-box motif contributes to binding specificity by influencing the three-dimensional structure of the DNA-binding site. Importantly, the additional specificity coming from sequences flanking the core motif could not be described by simply extending the core (45), which is not surprising given that the influence of flanking DNA is likely to be exerted through DNA shape and not through specific DNA contacts. DNA shape is a function of the DNA sequence; however, this relationship is complex (80) and cannot be captured by simple PWM models. Thus, in order to account for the influence of flanking regions on the affinity of DNA-binding sites, future models of binding specificity will likely use DNA shape characteristics explicitly.

Multiple modes of DNA binding

Complicating a simple model of DNA binding for many proteins is that they can bind DNA using two or more distinct modes (Figure 1). These different modes of interaction can lead to fundamentally different DNA-binding preferences (i.e., motifs), complicating simple binding models that do not explicitly account for these multiple modes. HT studies are increasingly revealing unknown diversity in DNA-binding preferences of numerous proteins, many of which may be the results of different binding modes (29,72,75,81). Here, we classify variable binding modes into four categories (Figure 1) and briefly review each category.

Figure 1.

Multiple modes of DNA binding. Schematized are examples illustrating four mechanistic categories by which proteins recognize different DNA sites via distinct structural modes.

Variable spacing

Some proteins that bind to bipartite DNA motifs (i.e. motifs composed of two half-sites), can recognize distinct classes of motifs in which the half-sites are separated by different numbers of bases (Figure 1A). This phenomenon was first observed more than 20 years ago for basic leucine zipper (bZIP) proteins, which can bind overlapping or adjacent TGAC half-sites (82). Subsequent studies have classified the bZIP family into two subfamilies based on protein sequence: (i) Activator protein 1 (AP-1) proteins, which generally prefer overlapping half-sites, but bind to adjacent half-sites with almost equal affinity and (ii) Activating transcription factor (ATF)/cAMP response element-binding protein (CREB) proteins, which generally prefer adjacent half-sites and bind very poorly to overlapping half-sites (82,83). However, a recent PBM-based study found that at least one ATF/CREB protein in yeast (Yap3) binds with high affinity to both adjacent and overlapping half-sites (74). Therefore, although some of the residues that defined bZIP half-site spacing have been identified (84), additional residues are also involved. Variable half-site spacing has also been well documented for the nuclear receptors (NRs). For example, both the peroxisome proliferator activated receptors (PPARs) and retinoic acid receptors (RARs) form heterodimers with the retinoid X receptor (RXR) and can bind to response elements composed of direct repeats of the 5′-AGGTCA-3′ half-site separated by different length spacers (85). PPAR:RXR dimers bind direct repeats spaced by 1 or 2 bases (i.e. DR1 = 5′-AGGTCANAGGTCA-3′; DR2 = 5′-AGGTCANNAGGTCA-3′), while RAR:RXR dimers bind to repeats spaced by 1 (DR1) or 5 (DR5) bases. The variable spacing has even been shown to affect in vivo function of DNA-bound NR dimers (86). RAR:RXR bound to DR5 elements can activate transcription in response to ligand, but will not activate transcription when bound to the DR1 elements (86).

Multiple DBDs

DNA-binding proteins can contain multiple, independent DBDs (3,87). Multiple DBDs can allow a protein to alternatively recognize different DNA elements using different DBDs. For example, the zinc finger (ZF) protein Evi1 contains 10 ZF domains separated into two autonomously functioning DBDs—an N-terminal seven ZF DBD and a C-terminal three ZF DBD—that recognize completely different DNA motifs (88–90). A still more complicated scenario is seen for the mouse TF Oct-1 that has two DBDs known as the POU (Pit-Oct-Unc) homeodomain (POUHD) and the POU-specific domain (POUS) (Figure 1B) (24,91,92). Oct-1 can bind to three distinct DNA motifs using different combinations of the two DBDs: the POUHD site is recognized by the POUHD domain, the POUS site is recognized by POUS domain and the composite POU site is recognized using both domains.

Multi-meric binding

Selective multimerization provides another means to expand or diversify the DNA-binding specificity of a protein. Proteins such as Oct-1(93) and RXR (94) have been reported to bind DNA as either monomers or homodimers. Recent large-scale studies using HT-SELEX assays (39,95) have revealed that the dimeric mode of binding might be more common than previously appreciated and even proteins that are known to bind DNA primarily as monomers, such as Elk1, (Figure 1C) can also form dimers with specific orientation and spacing preferences. These dimeric sites are enriched within genomic regions bound in vivo, suggesting that the dimeric profiles are biologically relevant. Furthermore, the dimer orientation and spacing preferences can sometimes distinguish individual members of the same TF family (95). For example, T-box factors share a common monomeric binding specificity but show seven distinct dimeric spacing/orientation preferences.

Alternate structural conformations

The recognition of distinct DNA motifs is expected when a protein contains multiple DBDs or binds as a multi-mer; however, some proteins with only a single DBD can also bind to multiple, distinctly different DNA sites. One such example is mouse TF sterol regulatory element-binding factor 1 (SREBF1) (Figure 1D), the ortholog of human TF sterol regulatory element-binding protein 1 (SREBP) (59,96). SREBF/SREBP proteins belong to the basic helix–loop–helix (bHLH) family of TFs that typically recognize symmetric E-boxes (5′-CAnnTG-3′). Unlike most bHLH proteins, SREBF/SREBP can bind the asymmetric sterol regulatory element 5′-ATCACnCCAC-3′. A co-crystal structure of the DNA-binding domain of human SREBP-1a bound to DNA revealed an asymmetric DNA–protein interface with one monomer binding the E-box half-site (5′-ATCAC-3′) using protein–DNA contacts typical of bHLH proteins and the second monomer recognizing the non-E-box half-site (5′-GTGGG-3′) using entirely different protein–DNA contacts (96,97). Thus, in the case of SREBF/SREBP, residues in the DBD are responsible for the multiple modes of DNA binding. Regions outside the DNA-binding domain of the protein can also enable alternate binding modes. For example, a region N-terminal to the basic DBD of yeast TF Hac1 is required for dual recognition modes. Mutations in this region were shown to preferentially reduce binding to one of the modes, whereas mutation of an arginine in the basic domain was crucial for the other mode of binding (98). This indicates that the protein can bind DNA using two distinct conformations, and individual residues (within and outside the DBD) play specific roles within each binding mode (98). The ability of proteins to utilize different DNA-binding modes presents difficulties or even precludes the construction of simple DNA-binding models for many proteins. The presence of distinct modes usually requires that multiple DNA-binding models are used to represent a protein’s specificity. This can cause difficulties when the data available to construct the DNA-binding model contains a mixture of multiple types of binding motifs, such as in genome-wide chromatin immunoprecipitation (ChIP)-seq datasets. In some situations, the DNA-binding modes can potentially be separated and independently characterized. For example, one could independently characterize the binding of individual DBDs when a protein contains multiple DBDs. However, even these situations can lead to difficulties as illustrated by the case of Oct-1 where the autonomous DBDs can function either independently or together, in which case examining them separately will cause one to miss the binding sites recognized when both DBDs are involved (Figure 1B).

Multi-protein recognition codes

The DNA-binding specificity of a TF is a primary determinant of where it will bind in the genome (99,100). However, transcriptional regulation often involves the assembly of multi-protein complexes on DNA (101,102), and it has been shown that multi-protein complexes can exhibit novel DNA-binding specificities not predictable from measurements of the individual proteins (40,48). Therefore, in efforts aimed at understanding the biophysical determinants of specificity in gene regulation it is of primary importance to also consider how multi-protein complexes can lead to new or enhanced specificities. Here, we will review the varied mechanism by which novel DNA-binding specificities can arise when proteins, both DNA-binding and non-DNA-binding, assemble together. We suggest that these mechanisms represent higher-order ‘multi-protein’ recognition codes—mechanisms by which genomic targeting of regulatory factors is encoded in multi-protein complexes.

Cooperative binding

Cooperative binding of TFs to DNA—usually via protein–protein interactions between adjacently bound TFs—stabilizes the proteins on the DNA and can enhance their individual contributions to transcription of a gene (3,103). Cooperative binding can also affect the DNA-binding specificity and lead to recognition of new binding sites (3) (Figure 2A). One way in which cooperative binding alters specificity is by extending the binding of a TF to include lower affinity sites as a result of the enhanced overall affinity of the cooperative complex for DNA. A well-studied example is the binding of yeast TFs MATα2, MATa1 and Mcm1, which regulate mating-type genes (103,104). MATα2 binds DNA cooperatively with cofactors MATa1 or Mcm1 to repress different sets of genes. MATα2:MATa1 heterodimers bind to sites found in the promoter regions of haploid-specific genes and repress their expression, whereas MATα2:Mcm1 heterotetramers bind different sites found in mating-type a-specific gene promoters and repress gene expression. Both complexes repress gene expression via recruitment of co-repressor proteins Tup1 and Ssn6 by interactions with MATα2. Therefore, MATα2 functions to recruit co-repressors and repress gene expression, but its DNA-binding specificity—and subsequently its target genes—are mediated by cooperative binding with cell-type-specific cofactors. In a recent study, a more subtle form of specificity alteration by cooperative binding was presented in which DNA-binding of one protein can stabilize or destabilize the DNA-binding of another protein via the deformation of the DNA structure (105). Notably, the cooperative binding was not mediated by direct protein contacts, but by an allosteric mechanism where DNA structural deformations propagate along the DNA helix. The cooperative allosteric effects were shown to operate even to ∼16 bp away.

Figure 2.

Multi-protein recognition codes. Schematized are examples illustrating four mechanistic categories by which targeting of proteins to distinct DNA sites involves the assembly of multi-protein complexes. A second, more active, mechanism has been described where cooperative binding alters the residue–base contacts that a TF can make with the DNA, thereby altering its inherent binding specificity (3,40) (Figure 2A). One example is the binding of the TFs Ets-1 and Pax5, where Pax5 alters the DNA contacts made by Ets-1, making it more permissive to alternate DNA sequences (106). Bound alone to DNA, Ets-1 binds with high-affinity to 5′-GGAA-3′ and with lower affinity to 5′-GGAG-3′. A tyrosine residue (Y395) in the Ets-1 recognition helix interacts with this fourth base position (underlined) and mediates the preference for adenine over guanine. However, in complex with Pax-5, the Y395 residue adopts an alternate conformation and no longer interacts with the DNA at this base position. This has the effect of making the Ets-1 DNA binding permissive for both sequences, and therefore broadens the specificity of the Pax-5:Ets-1 complex to include the suboptimal 5′-GGAG′-3′ Ets-1 half-site. Another example has been elucidated in a series of studies on Drosophila Hox proteins and their cofactor Extradenticle (Exd) (40,107). The Hox protein Sex combs reduced (Scr) binds with the cofactor protein Exd preferentially to a paralog-specific site (fkh250), found in the fork head (fkh) gene, over a consensus site (fkh250con) recognized by other Hox:Exd complexes (107,108). The preferential binding of the Scr:Exd complex to the fkh250 site involves the stabilization of a flexible ‘arm’ region N-terminal to the Scr homeodomain (107). Two of the stabilized residues in the N-terminal arm region (Arg3 and His-12) insert into the fkh250 DNA minor groove, which is narrower than the same region in the fkh250con sequence and more electrostatically favorable for the Arg and His residues. Therefore, the enhanced specificity of the Scr:Exd complex to the fhk250 sequence is due to local DNA shape (i.e., minor groove width) readout by the Scr Arg and His residues when stabilized by the cooperatively bound Exd cofactor. This work highlights how the intrinsic specificity of a monomeric homeodomain (Scr) can be altered or refined through interaction with a protein cofactor (Exd), a characteristic called latent specificity. A recent comprehensive analysis of other Hox proteins extended these results and demonstrated latent specificities for all the Drosophila Hox proteins (40). A HT-SELEX/SELEX-seq assay was used to measure and compare the DNA-binding specificity of all eight Drosophila Hox proteins alone and in complex with Exd. As seen for Scr:Exd, binding with the Exd cofactor revealed latent specificity differences between the Hox proteins, not observable when Hox proteins were examined as monomers.

Cooperative cofactor recruitment

The recruitment of non-DNA-binding cofactors (i.e. coactivators and corepressors) to promoters and enhancers by DNA-binding TFs is central to gene regulation (101). ‘Cooperative’ recruitment occurs when a cofactor can make simultaneous interactions with multiple, DNA-bound TFs; the result is a synergistic enhancement in the cofactor recruitment (109,110). Cooperative cofactor recruitment integrates the contributions from multiple TFs to achieve maximal gene expression and is a powerful mechanism for establishing an integrative, ‘AND-type’ regulatory logic in gene transcription (111–114). At the same time, cooperative recruitment enhances the binding specificity of the cofactor. The genomic location of the cofactor is determined not by the presence of a single TF, but by the less-frequent (i.e., more specific), coincident presence of multiple TFs (Figure 2B). A paradigm for cooperative cofactor recruitment are multi-protein, enhanceosome complexes in which multiple TFs cooperatively assemble on DNA and coordinately recruit a common cofactor. A well-studied example is the enhanceosome that binds to the Interferon beta (IFNβ) gene promoter and cooperatively recruits the coactivator CREB-binding protein (CBP) (111,112). In response to viral infection, the TFs ATF-2, c-Jun, Irf3, Irf7 and NF-κB, facilitated by the architectural factor Hmga1, bind cooperatively to the IFNβ promoter. Multiple proteins in the DNA-bound complex contribute to the cooperative recruitment of CBP, and operate synergistically to enhance transcription (111). While NF-κB, the ATF-2:c-Jun dimer and Irf factors can recruit CBP individually, studies have demonstrated that removal of the activation domains from IRF or NF-κB factors resulted in complete loss of CBP recruitment, highlighting the cooperative nature of the CBP cofactor recruitment (111). A similar situation occurs in the cooperative recruitment of class II, major histocompatibility complex transactivator (CIITA) to a set of conserved transcriptional control elements found in the promoters of major histocompatibility complex class II (MHC-II) genes (113,115). The control sequences contain four DNA elements termed the W, X, X2 and Y boxes, with a conserved sequence, spacing and orientation across the different promoters. The TFs X-box binding protein (RFX), X2-binding protein (X2BP) and Nuclear factor Y (NF-Y) bind to the X, X2 and Y boxes, respectively, and form a cooperative, nucleoprotein, enhanceosome complex. This multi-protein complex recruits the CIITA to the MHC-II promoters via multiple weak interactions mediated by different components of the enhanceosome complex (113). Removal of interactions from any of these component proteins leads to significant loss of CIITA recruitment.

Allosteric effects in cofactor recruitment

DNA can act as a sequence-specific allosteric ligand that alters the function of a DNA-bound TF (116–121). A common mechanism by which this functions is by DNA sequence-dependent effects on cofactor recruitment. A TF bound to one set of DNA sites will recruit a specific cofactor, whereas the same TF bound to a different set of sites will not recruit the cofactor. Therefore, allosteric control of cofactor recruitment functionally partitions DNA binding sites of a TF and consequently refines binding specificity of the recruited cofactor (Figure 2C). Allosteric mechanisms have been reported for both the glucocorticoid (GR) and the estrogen (ER) nuclear receptors (116,118). The DNA sequence bound by ER can influence its transcriptional activity and recruitment of LXXLL-containing peptides and p160 cofactors (SRC-1, GRIP1, ACTR) (116). The DNA sequence bound by GR, called the GR response element (GRE), was shown to influence gene expression, but the effect did not correlate with GR-binding affinity, suggesting a mechanism that involves differential recruitment of cofactor proteins (118). Furthermore, knock down of Brahma, a subunit of the SWI/SNF nucleosome remodeling complex, affected GR-dependent gene expression in a GRE sequence-dependent manner, suggesting that the composition of the DNA-bound protein complexes were allosterically regulated. Allosteric control of cofactor recruitment has also been described for different NF-κB family dimers (117,120). Single-base changes in NF-κB binding sites have been demonstrated to affect recruitment of the TF Irf3 protein by DNA-bound p65/Rel-containing dimers (117) as well as the recruitment of the cofactors histone de-acetylase 3 (HDAC3) and Tip60 to a DNA-bound, ternary complex containing p50 homodimers and Bcl3 (120). The mechanism of allosteric control for GR and NF-κB appears to be moderated by structural differences between TFs bound to different DNA sites. Structural studies of GR bound to different GREs demonstrated that the structure of a ‘lever arm’ loop region, connecting the DNA recognition helix and the ligand binding domain, was highly dependent on the GRE sequence (118). Studies of NF-κB dimers bound to different kB sites have shown that DNA sequences differences can result in structural rearrangements (rotations and translations) of one dimer subunit (19,21,122). Protease protection assays and detailed binding studies have also suggested DNA sequence-dependent changes in protein conformation (72,123). In contrast to the situation for GR and NF-κB where allostery is mediated by structural differences between the DNA-bound complexes, allostery involving the POU factors operates via larger, more global differences in quaternary structure when bound to different DNA sites. POU factors, such as Oct-1 described above, have two DBDs connected by a flexible linker: POUS and a POUHD (121). POU factors can homo- and heterodimerize to a diverse set of DNA-binding sites in which the relative orientations and spacing of the POUS and POUHD can vary and subsequently lead to differential cofactor recruitment (121,124). The POU factors Oct-1 and Oct-2 can bind as homodimers to two distinct response elements, PORE and MORE, found in the promoters of B-cell-specific genes (121). When bound to the PORE sequence, Oct dimers can recruit the transcriptional coactivator Oct-binding factor 1 (OBF-1); however, dimers bound to the MORE sequence do not recruit OBF-1. Crystal structures of the DNA-bound Oct-1 dimer have revealed that in the MORE-bound dimer the OBF-1 binding site on Oct-1 is blocked due to the relative orientation of the dimer subunits, but in the PORE-bound dimer the site is exposed, allowing OBF-1 recruitment (125,126). A similar situation arises with the Pit-1 POU factor in which binding site differences lead to cell-type-specific expression patterns and differential recruitment of the N-CoR co-repressor complex (124,127,128). Studies revealed that an extra 2 bp between the POUS and POUHD half-sites found in the growth hormone (GH) gene promoter, compared with a similar affinity site in the prolactin gene promoter, altered the quaternary structure of the DNA-bound Pit-1 and enabled the recruitment of the N-CoR co-repressor complex to the GH site (124).

Cofactor-based specificity

Targeting of non-DNA-binding cofactors to DNA is traditionally viewed as being mediated solely via protein–protein interactions with DNA-bound TFs. In other words, the cofactor is ‘recruited’ and interacts with DNA indirectly through the TFs. However, several studies have described a provocative alternative mechanism in which recruited cofactors interact directly with DNA when part of a larger, multi-protein complex (48,129). Further, as part of this larger complex, the non-DNA-binding cofactors mediate sequence-specific interactions that preferentially stabilize the binding to composite DNA sites containing specific auxiliary motifs (Figure 2D). This represents yet another mechanism to enhance the DNA sequence specificity of multi-protein regulatory complexes. The regulation of sulfur metabolism genes in yeast involves a multi-protein complex composed of a sequence-specific TF Cbf1 and two non-DNA-binding cofactors Met28 and Met4 (130,131). Cbf1 is a bHLH protein and binds as a homodimer to E-box sites with a consensus 5′-CACGTG-3′ core (131). A PBM-based analysis revealed that the Met4:Met28:Cbf1 complex binds preferentially to composite DNA sites in which the Cbf1 E-box is flanked by an adjacent 5′-RYAAT-3′ ‘recruitment’ motif (i.e. 5′-RYAATNNCACGTG-3′) (48). High-affinity binding to the composite sites is highly cooperative and requires the full heteromeric complex. Sequence analysis, modeling and binding assays suggest that the recruitment motif is recognized by the non-DNA-binding Met28 subunit. A highly similar situation has also been described for the multi-protein Oct-1:HCF-1:VP16 complex that regulates expression of the herpes simplex virus immediate-early genes by binding to a composite 5′-TAATGARAT sequence in the gene promoters (129). The complex is composed of the DNA-binding TF Oct-1 and two non-DNA-binding cofactors: the viral activator VP16 and the host cell factor 1 (HCF-1). Oct-1 binds to the 5′-TAAT sequence and a DNA-binding surface of VP16 mediates interactions with the 5′-GARAT sequence. Similar to the situation in yeast, the VP16 cofactor does not interact with DNA well on its own, but when stabilized on DNA as part of a larger multi-protein complex it can adopt a configuration that mediates recognition of the 5′-GARAT subsequence. Therefore, in both situations multi-protein complexes preferentially bind to composite binding sites composed of a recruitment motif (5′-RYAAT-3′ and 5′-GARAT-3′) and an adjacent TF-binding motif (5′-CACGTG-3′ and 5′-TAAT-3′) that are recognized by a non-DNA-binding cofactor (Met28 and VP16) and sequence-specific TF (Cbf1 and Oct-1) subunits, respectively.

SUMMARY

Complexities in DNA-binding operate at multiple levels and complicate efforts to construct simple models, or recognition codes, of DNA-binding. Flexible protein-DNA interactions and non-local structural effects confound simple descriptions of DNA base preferences, and require the use of binding models, such as PWMs, that can account for the resulting sequence degeneracy. The ability of proteins to bind DNA via multiple modes further complicates the situation and leads to the requirement of multiple binding models for many proteins. Fortunately, HT technologies are not only increasing the rate at which DNA-binding proteins are being characterized, but are providing the comprehensive binding data needed to construct models that involve multiple DNA-binding modes (23,24,27,28,35,38–40,42,43,48,59,73,74,79,132–134). An added benefit of the comprehensive nature of certain HT datasets, such as comprehensive k-mer or binding site level data, is that predictions of binding sites in the genome can be performed directly using the measured binding data (23,24,40,43,133). Multi-protein complexes add tremendously to the DNA-binding diversity of proteins and provide a key mechanism to integrate signals in gene regulation. However, these higher-order multi-protein mechanisms cause difficulties in constructing models, primarily due to the fact that DNA-binding models of proteins examined in isolation may not capture the binding sites utilized in vivo when cofactors are present. Here again, HT technologies are being used successfully to characterize binding of multi-protein complexes, revealing recognition codes mediated by cooperative binding (40,133) and cofactor-mediated targeting (48). The continued application of HT technologies to examine DNA binding of multi-protein complexes will undoubtedly provide increasingly refined binding models and provide new insights into gene targeting in vivo. Furthermore, just as the application of HT technologies has drawn our attention to the prevalence of TFs that exhibit subtle binding preferences or bind DNA via multiple modes, studies examining multi-protein complexes will likely identify wide spread use of higher-order mechanisms such as allostery, cooperative recruitment and cofactor-mediated targeting. Integrating these datasets with whole genome chromatin immunoprecipitation (ChIP) and expression datasets will lead to much more complete and sophisticated descriptions of specificity in gene regulation.

FUNDING

NIH [K22AI093793 to T.S.]; PhRMA Foundation Research Starter Grant (to R.G.); Basil O'Connor Research Award #5-FY13-212 from March of Dimes Foundation (to R.G.). Funding for open access charge: Waived by Oxford University Press. Conflict of interest statement. None declared.

131 in total

1. Quantifying DNA-protein interactions by double-stranded DNA arrays.

Authors: M L Bulyk; E Gentalen; D J Lockhart; G M Church
Journal: Nat Biotechnol Date: 1999-06 Impact factor: 54.908

Review 2. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

3. Structural studies of Ets-1/Pax5 complex formation on DNA.

Authors: C W Garvie; J Hagman; C Wolberger
Journal: Mol Cell Date: 2001-12 Impact factor: 17.970

Review 4. Recognition of specific DNA sequences.

Authors: C W Garvie; C Wolberger
Journal: Mol Cell Date: 2001-11 Impact factor: 17.970

5. Evi-1, a murine zinc finger proto-oncogene, encodes a sequence-specific DNA-binding protein.

Authors: A S Perkins; R Fishel; N A Jenkins; N G Copeland
Journal: Mol Cell Biol Date: 1991-05 Impact factor: 4.272

6. Identification of transcription factor binding sites with variable-order Bayesian networks.

Authors: I Ben-Gal; A Shani; A Gohr; J Grau; S Arviv; A Shmilovici; S Posch; I Grosse
Journal: Bioinformatics Date: 2005-03-29 Impact factor: 6.937

7. Octamer transcription factors bind to two different sequence motifs of the immunoglobulin heavy chain promoter.

Authors: I Kemler; E Schreiber; M M Müller; P Matthias; W Schaffner
Journal: EMBO J Date: 1989-07 Impact factor: 11.598

8. Exploring the DNA-recognition potential of homeodomains.

Authors: Stephanie W Chu; Marcus B Noyes; Ryan G Christensen; Brian G Pierce; Lihua J Zhu; Zhiping Weng; Gary D Stormo; Scot A Wolfe
Journal: Genome Res Date: 2012-04-26 Impact factor: 9.043

9. The role of DNA shape in protein-DNA recognition.

Authors: Remo Rohs; Sean M West; Alona Sosinsky; Peng Liu; Richard S Mann; Barry Honig
Journal: Nature Date: 2009-10-29 Impact factor: 49.962

10. Stability selection for regression-based models of transcription factor-DNA binding specificity.

Authors: Fantine Mordelet; John Horton; Alexander J Hartemink; Barbara E Engelhardt; Raluca Gordân
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

88 in total

1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

2. Increased subtlety of transcription factor binding increases complexity of genome regulation.

Authors: Peter H von Hippel
Journal: Proc Natl Acad Sci U S A Date: 2014-12-02 Impact factor: 11.205

3. A Biophysical Approach to Predicting Protein-DNA Binding Energetics.

Authors: George Locke; Alexandre V Morozov
Journal: Genetics Date: 2015-06-16 Impact factor: 4.562

Review 4. Using protein-binding microarrays to study transcription factor specificity: homologs, isoforms and complexes.

Authors: Kellen K Andrilenas; Ashley Penvose; Trevor Siggers
Journal: Brief Funct Genomics Date: 2014-11-26 Impact factor: 4.241

Review 5. DNA-protein interaction: identification, prediction and data analysis.

Authors: Abbasali Emamjomeh; Darush Choobineh; Behzad Hajieghrari; Nafiseh MahdiNezhad; Amir Khodavirdipour
Journal: Mol Biol Rep Date: 2019-03-26 Impact factor: 2.316

Review 6. A conserved role for transcription factor sumoylation in binding-site selection.

Authors: Emanuel Rosonina
Journal: Curr Genet Date: 2019-05-15 Impact factor: 3.886

7. Multiple DNA-binding modes for the ETS family transcription factor PU.1.

Authors: Shingo Esaki; Marina G Evich; Noa Erlitzki; Markus W Germann; Gregory M K Poon
Journal: J Biol Chem Date: 2017-08-08 Impact factor: 5.157

8. Whole-Genome and Epigenomic Landscapes of Etiologically Distinct Subtypes of Cholangiocarcinoma.

Authors: Apinya Jusakul; Ioana Cutcutache; Chern Han Yong; Jing Quan Lim; Mi Ni Huang; Nisha Padmanabhan; Vishwa Nellore; Sarinya Kongpetch; Alvin Wei Tian Ng; Ley Moy Ng; Su Pin Choo; Swe Swe Myint; Raynoo Thanan; Sanjanaa Nagarajan; Weng Khong Lim; Cedric Chuan Young Ng; Arnoud Boot; Mo Liu; Choon Kiat Ong; Vikneswari Rajasegaran; Stefanus Lie; Alvin Soon Tiong Lim; Tse Hui Lim; Jing Tan; Jia Liang Loh; John R McPherson; Narong Khuntikeo; Vajaraphongsa Bhudhisawasdi; Puangrat Yongvanit; Sopit Wongkham; Yasushi Totoki; Hiromi Nakamura; Yasuhito Arai; Satoshi Yamasaki; Pierce Kah-Hoe Chow; Alexander Yaw Fui Chung; London Lucien Peng Jin Ooi; Kiat Hon Lim; Simona Dima; Dan G Duda; Irinel Popescu; Philippe Broet; Sen-Yung Hsieh; Ming-Chin Yu; Aldo Scarpa; Jiaming Lai; Di-Xian Luo; André Lopes Carvalho; André Luiz Vettore; Hyungjin Rhee; Young Nyun Park; Ludmil B Alexandrov; Raluca Gordân; Steven G Rozen; Tatsuhiro Shibata; Chawalit Pairojkul; Bin Tean Teh; Patrick Tan
Journal: Cancer Discov Date: 2017-06-30 Impact factor: 39.397

9. DeFCoM: analysis and modeling of transcription factor binding sites using a motif-centric genomic footprinter.

Authors: Bryan Quach; Terrence S Furey
Journal: Bioinformatics Date: 2017-04-01 Impact factor: 6.937

10. How a single 5-methylation of cytosine regulates the recognition of C/EBPβ transcription factor: a molecular dynamic simulation study.

Authors: Lihua Bie; Likai Du; Qiaoxia Yuan; Jun Gao
Journal: J Mol Model Date: 2018-06-11 Impact factor: 1.810