Literature DB >> 32802274

Motif grammar: The basis of the language of gene expression.

Abstract

Collaboration of transcription factors (TFs) and their recognition motifs in DNA is the result of coevolution and forms the basis of gene regulation. However, the way how these short genomic sequences contribute to setting the level of gene products is not understood in sufficient detail. The biological problem to be solved by the cell is complex, because each gene requires a unique regulatory network in each cellular condition using the same genome. Thus far, only some components of these networks have been uncovered. In this review, we compiled the features and principles of the motif grammar, which dictates the characteristics and thus the likelihood of the interactions of the binding TFs and their coregulators. We present how sequence features provide specificity using, as examples, two major TF superfamilies, the bZIP proteins and nuclear receptors. We also discuss the phenomenon of "weak" (low affinity) binding sites, which appear to be components of several important genomic regulatory regions, but paradoxically are barely detectable by the currently used approaches. Assembling the complete set of regulatory regions composed of both weak and strong binding sites will allow one to get more comprehensive lists of factors playing roles in gene regulation, thus making possible the deeper understanding of regulatory networks.

Entities: Chemical Disease Gene Species

Keywords: Basic leucine zipper; Motif grammar; Nuclear receptor; Transcription factor; Weak motif

Year: 2020 PMID： 32802274 PMCID： PMC7406977 DOI： 10.1016/j.csbj.2020.07.007

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

DNA binding by transcription factors (TFs) takes place at recognition sites that meet their specific sequence requirements (motifs) throughout the genome. However, there is an incredible redundancy of these sites for a number of reasons. DNA sequence motifs generally contain both strictly invariable and degenerate nucleotides. The latter means that in certain positions two, three, or even four different nucleotides can be tolerated by the binding TF(s), without necessarily changing the binding affinity. As a result, a motif with four invariable nucleotides and two interchangeable ones ([1/4]4 × [2/4]2) or three invariable and four interchangeable ones ([1/4]3 × [2/4]4) can be found in each kilobase of a genome by chance. In a typical mammalian genome, this can result in millions of putative binding sites per TF. However, most of these are hidden in the chromatin structure, so their accessibility is key to establish specific DNA-protein interactions. Therefore, out of the millions of putative binding sites, only a fraction (hundreds to tens of thousands) is indeed available and occupied in the individual cells at a certain moment and condition, and even fewer have direct functional consequences. The motifs characterizing these binding sites evolved along with the TFs, especially with their DNA binding domains (DBDs) to provide specific contacts during the course of phylogenesis and ontogenesis. Accordingly, members of most major TF (super)families, such as the basic leucine zipper (bZIP), homeodomain, and high mobility group (HMG) proteins, nuclear receptors (NRs), as well as the TATA-binding protein (TBP) and CCAAT-binding complex (CBC), can be found in all metazoan, and except for NRs, also in lower eukaryotes [1], [2], [3], [4], [5], [6].

Interaction between DNA and transcription factors

Beyond the sequence-specific DNA binding of the major groove, multiple factors contribute to DNA-TF interactions, including the binding of the sugar-phosphate backbone or the base pairs from the side of the minor groove, which latter allows an additional, basically binary sequence readout [7]. Interferon regulatory factors (IRFs) and certain NRs show dual sequence recognition, as these bind in both the major and minor grooves at the same time [8], [9]. TBP, CBC, and HMG protein family members, in contrast, bind primarily in the minor groove; however, their recognition sequences allow the specific bending of DNA, which is critical for the interaction and downstream functions [5], [6], [10]. Like in these cases, the composition of motifs determines the possible local conformations of DNA, which can fit the interaction surfaces of DNA-binding proteins. These so-called shape motifs, which can be achieved even by diverse sequences, imply an additional layer of specificity for sequence motifs [7], [11], [12], [13], [14]. Similarly, DNA methylation also affects DNA-protein contacts. The binding of methylated GC-rich sequences by TFs is generally greatly limited [15]. For instance, interaction between CCCTC-binding factor (CTCF) and the insulator elements is hindered by DNA methylation [16], [17]. In contrast, the repressor Kaiso shows DNA methylation-dependent binding [18], [19].

The hierarchy of binding sites: From monomer binding sites to enhancers

TFs work in collaboration with other TFs and non-DNA-binding coregulators, forming multi-protein complexes and establishing the connection between promoters and enhancers/silencers to regulate gene expression. Promoters, by definition, contain motifs that facilitate the recruitment of general TFs and RNA polymerase, resulting in basal gene expression levels. Enhancers, in turn, contain cell type-specific motifs, which recruit TFs responsible for the induction of phenotype-determining genes from a distance via looping. Silencers have the opposite effect to enhancers, although this is due to the binding TFs, which can take part in gene regulation as activators, repressors, or collaborators. As a result, the same sequence can behave either as an enhancer or silencer depending on the cellular context [20]. For simplicity, we will refer to all these, typically promoter-distal regulatory regions as enhancers. Genes are regulated by various numbers of TF binding sites. Conserved gene regulatory regions, such as promoters and certain enhancers, can be hundreds of nucleotides long, including dozens of binding sites, while most enhancers have evolved recently and consist of a few or even one single binding site [21]. Several TFs are functional as monomers and interact with single monomer binding sites with high affinity, although these should be long enough – usually at least six invariable nucleotides – to provide specific surface for the required amount of molecular contacts. Most TFs bind DNA as dimers, and these can be further assembled to larger complexes capable of DNA binding at multiple surfaces. In line with these, monomer and dimer binding sites can be arranged into more complex elements, and ultimately, add up to collaborating promoters and enhancers. Within these functional sequences, monomer binding sites can be located in several ways relative to each other; however, their distribution is not fully random – there are regularities within the line of elements to make possible interactions with the TFs proper for cell type-specific gene regulation. Moreover, the joint recruitment of multiple TFs can make suboptimal binding sites accessible, thus enhancers (and promoters) can contain weak binding sites [22]. For instance, several developmental genes are regulated by both optimal and suboptimal binding sites with an optimized relative distribution, while experimental improvement of weak binding sites cause ectopic gene expression [23], [24], [25], [26].

Transcription factors and composite elements

Most TFs form dimers, in which both proteins are capable of DNA binding at dimer binding sites, thus creating the possibility or further increasing the specificity of DNA-protein interactions (Fig. 1). Dimerization can take place with the involvement of two identical TFs (homodimers), TFs from the same (super)family, or even from families of different origin (heterodimers) [7], [27], [28], [29]. For instance, bZIP proteins have an integrated DNA-binding and dimerization domain, which enables a flexible choice of dimerizing partners within the superfamily and specific motif recognition depending on the partners [1], [30], [31], [32]. Steroid hormone receptors and dimeric orphan receptors (NR superfamily) form homodimers or heterodimers with NRs from the same family, while NRs from most NR1 families (NR1A, B, C, H, I) form heterodimers with the retinoid X receptors (RXRs, NR2B) [2], [33]. In this superfamily, two zinc fingers provide sequence-specific recognition and also contribute to dimerization.

Fig. 1

Classification of transcription factor binding sites. Monomer binding sites, which can be bound by single TFs, often add up to larger units. Two, essentially identical half-sites can be bound by homodimers formed by identical TFs (blue circles, right) or heterodimers formed by related TF partners (blue and green circle). Composite elements are built up form at least two monomer binding sites specific for different TFs (blue and dark purple circles). Boxes colored according to TFs represent monomer binding sites. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Although most known dimers are assembled from closely related TFs, currently, there is an emerging number of newly identified heterodimers built up from TFs from different protein (super)families [28], [29]. While the former group of dimers binds two associated, substantially identical monomer binding sites – so-called half-sites –, the latter binds composite elements of different monomer binding sites (Fig. 1). The protein product of myeloid ecotropic viral integration site 1 (MEIS1), for example, binds several composite elements in collaboration with members of other homeodomain protein families [27], [34]. It also has a homodimer binding site; however, the DBDs within the supposed homodimer are on the opposite side of DNA, so there is no contact between them. In contrast, in the case of MEIS1/distal-less homeobox 3 (DLX3) heterodimers, the DBD of DLX3 distorts DNA, narrows and binds the minor groove, and then it is capable of interacting more closely with the DBD of MEIS1 [27]. Both examples show contacts that are primarily independent of TF-TF interactions and suggest that dimer binding sites can have major roles both in TF binding and dimerization, although other TFs, like bZIP proteins, have interaction surfaces large and compatible enough to provide dimerization even in the absence of specific DNA segments. There are also long known heterodimers and composite elements with critical developmental and cell lineage-determining roles. Collaboration of octamer-binding transcription factor 4 (OCT4, homeodomain) and sex determining region Y (SRY)-box 2 (SOX2, HMG) on their shared composite elements is a key for the maintenance of embryonic stem cells [6], [35]. Purine-rich nucleic acid binding protein 1 (PU.1; erythroblast transformation-specific [ETS] superfamily), in turn, is indispensable for the development and maintenance of most white blood cells and tightly collaborates with IRF4 or IRF8 on several kinds of composite elements [36], [37]. Within these elements, the core nucleotide tetramers tandemly follow each other, but their order and spacing are different. The ETS:IRF composite element (EICE) with two spacer nucleotides and ETS:IRF response element (EIRE) with three spacer nucleotides are bound by PU.1 and IRF4/8 in this order, while the IRF:ETS composite sequence (IECS) with two or three spacer nucleotides is bound in the opposite order by the two proteins. These motifs imply different ways of dimerization, although the transcriptional effects of different conformations of the formed ternary complexes are not known [38], [39], [40], [41]. There are additional ETS proteins that form heterodimers with TFs of different origins and have composite elements accordingly. Out of these, several elements, such as that of the glial cells missing transcription factor 1 (GCM1)/ETS-like 1 (ELK1) heterodimer, lack flanking (spacer) nucleotides or contain altered ones [27]. Thereby, in certain cases, the knowledge of canonical monomer binding sites is not sufficient to cover all elements in use, but also dimer-specific information should be used to cover all binding sites (the cistrome) of a certain TF.

Motif grammar

Monomer binding sites within longer gene regulatory units can follow each other in several ways to provide specificity and selectivity. In the simplest case, half-sites can form an asymmetric head-to-tail (tandem or direct repeat, DR) or symmetric (inverted or everted) repeat (IR or ER, respectively); in addition, the distance between them can also vary. For instance, within the signal transducer and activator of transcription (STAT) family all dimers are capable of binding the quasi-palindromic, interferon γ-activated sequence (GAS) with three spacer nucleotides (5′-TTC(T/C)N(A/G)GAA-3′), while STAT6 homodimers prefer four nucleotide-long spacers [42], [43], [44]. In the case of composite elements and higher-order regulatory sequences, besides the orientation and spacing, the type, number, and order of monomer binding sites also have significance, not to mention the strength (affinity) of the individual binding sites and the shape information encoded in DNA (Table 1). These sequence features that determine the quality, order, orientation, and putative interactions of the binding TFs and their coregulators are termed motif grammar, or promoter/enhancer grammar, if we consider entire regulatory regions [23], [45]. To illustrate how sequence features determine specific DNA-TF interactions, we discuss in the next sections the basic sequence requirements of two major TF groups, the bZIP and NR superfamilies, involved in important physiological and pathological processes [1], [2]. Furthermore, we listed in Table 2 all elements of this review, mentioned in connection with motif grammar.

Table 1

Features of motif and enhancer (promoter) grammar.

Motif grammar			Enhancer grammar	Features
Monomer binding site	Dimer binding site	Composite element	Cluster of elements	Features
?	?	?	+	Gene regulation
3–15	6–20	10–25	6– hundreds	Size (bp)
1	2	2–3	1– tens	Number of binding sites
–	–/+	+	+	Type
–	–/+	+	+	Order
–	+	+	+	Orientation
–	+	+	+	Spacing
+	+	+	+	Strength
+	+	+	+	Shape

Sequence features that contribute (+) or do not contribute (−) to motif/enhancer complexity (specificity) are indicated. Homodimer binding sites, for instance, contain basically identical monomer binding sites (half-sites), so their orientation, spacing, strength, and shape can vary (+), but their type and order are self-evident (−), while in other dimer binding sites the type and order can also be determinate features (+). Unlike in the case of enhancers (promoters), the effect of individual elements on gene expression is uncertain (‘?’).

Table 2

Summary of representative transcription factors and their binding sites.

Name of transcription factors (element)	Citation(s)
MEIS1/MEIS1	Jolma et al. 2015 [27]
MEIS1/DLX3	Jolma et al. 2015 [27]
OCT4/SOX2	Rodda et al. 2005 [35]
IRF/IRF (ISRE)	Fujii et al. 1999 [9]
PU.1/IRF4/8 (EICE/EIRE)	Meraro et al. 2002 [38]
IRF4/8/PU.1 (IECS)	Tamura et al. 2005 [39]
GCM1/ELK1	Jolma et al. 2015 [27]
STAT/STAT (GAS)	Pearse et al. 1993 [42], Seidel et al. 1995 [43]
STAT6/STAT6	Li et al. 2016 [44]
JUN/FOS (TRE)	Deppmann et al. 2006 [46], Amoutzias et al. 2007 [1], Cohen et al. 2018 [31]
CREB, ATF, JUN dimers (CRE)	Deppmann et al. 2006 [46], Amoutzias et al. 2007 [1], Cohen et al. 2018 [31]
sMAF/CNC (MARE)	Inamdar et al. 1996 [47], Newman et al. 2003 [30]
lMAF/lMAF (MARE)	Kataoka et al. 1994 [49], Newman et al. 2003 [30], Kurokawa et al. 2009 [48]
C/EBP/C/EBP	Cohen et al. 2018 [31]
C/EBP/ATF4 (CARE)	Cohen et al. 2018 [31]
JUNB/BATF/IRF4/8 (AICE)	Glasmacher et al. 2012 [51]
NR3C dimers (IR3)	Mangelsdorf et al. 1995 [2]
ER/ER (ERE)	Mangelsdorf et al. 1995 [2]
RAR/RXR (RARE, DR0-2,5,8, IR0)	Rastinejad et al. 1995 [59], Moutier et al. 2012 [53], Simandi et al. 2018 [52]
PPAR/RXR (PPRE)	IJpenberg et al. 1997 [8], Chandra et al. 2008 [55], Nagy et al. 2020 [67]
REV-ERB/REV-ERB (Rev-DR2)	IJpenberg et al. 1997 [8], Sierk et al. 2001 [54]
RXR/LXR (LXRE)	Feldmann et al. 2013 [58], Lou et al. 2014 [57]
RXR/VDR (VDRE)	Rastinejad et al. 1995 [59], Orlov et al. 2012 [60]
RXR/THR (THRE)	Rastinejad et al. 1995 [59], Grøntved et al. 2015 [61]
ROR (RORE)	IJpenberg et al. 1997 [8]
NR3B	Johnston et al. 1997 [65]
NR4A	Wilson et al. 1992 [64]
NR5A	Lala et al. 1992 [63]

MEIS1, myeloid ecotropic viral integration site 1; DLX3, distal-less homeobox 3; OCT4, octamer-binding transcription factor 4; SOX2, sex determining region Y (SRY)-box 2; IRF, interferon regulatory factor; ISRE, interferon-stimulated response element; PU.1, purine-rich nucleic acid binding protein 1; ETS, erythroblast transformation-specific; EICE/EIRE, ETS:IRF composite/response element; IECS, IRF:ETS composite sequence; GCM1, glial cells missing transcription factor 1; ELK1, ETS-like 1; STAT, signal transducer and activator of transcription; GAS, interferon γ-activated sequence; RARE, retinoic acid response element; further abbreviations for bZIP and NR proteins are listed in Fig. 2, Fig. 3.

Features of motif and enhancer (promoter) grammar. Sequence features that contribute (+) or do not contribute (−) to motif/enhancer complexity (specificity) are indicated. Homodimer binding sites, for instance, contain basically identical monomer binding sites (half-sites), so their orientation, spacing, strength, and shape can vary (+), but their type and order are self-evident (−), while in other dimer binding sites the type and order can also be determinate features (+). Unlike in the case of enhancers (promoters), the effect of individual elements on gene expression is uncertain (‘?’). Summary of representative transcription factors and their binding sites. MEIS1, myeloid ecotropic viral integration site 1; DLX3, distal-less homeobox 3; OCT4, octamer-binding transcription factor 4; SOX2, sex determining region Y (SRY)-box 2; IRF, interferon regulatory factor; ISRE, interferon-stimulated response element; PU.1, purine-rich nucleic acid binding protein 1; ETS, erythroblast transformation-specific; EICE/EIRE, ETS:IRF composite/response element; IECS, IRF:ETS composite sequence; GCM1, glial cells missing transcription factor 1; ELK1, ETS-like 1; STAT, signal transducer and activator of transcription; GAS, interferon γ-activated sequence; RARE, retinoic acid response element; further abbreviations for bZIP and NR proteins are listed in Fig. 2, Fig. 3.

Fig. 2

Schematic representation of major bZIP motifs. TRE/CRE (5′-TGA(C/G)-3′) and MARE (5′-TGCTGA(C/G)-3′) half-sites are marked by blue arrows, C/EBP half sites are marked by green arrows, and spacer nucleotides are marked by black dots. Schematic motif logos and motif/protein names are indicated (TRE, TPA response element; CRE, cAMP response element; CREB, CRE binding protein; ATF, activating transcription factor; MAF, musculoaponeurotic fibrosarcoma protein; CNC, cap′n′collar-type bZIP protein; C/EBP, CCAAT/enhancer-binding protein; (s)MARE, (small) MAF response element; CARE, C/EBP:ATF response element). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3

Schematic representation of nuclear receptor motifs. The general (5′-(A/G)GGTCA-3′) and the NR3C steroid hormone receptor-specific (5′-AGAACA-3′) half-sites are marked by blue or green arrows, respectively. 5′ extensions are marked by cyan arrows, and spacer nucleotides are marked by black dots. Schematic motif logos and motif/protein names are indicated (IR, inverted repeat; DR, direct repeat; ROR, retinoic acid receptor (RAR)-related orphan receptor; PPAR, peroxisome proliferator (PP)-activated receptor; RXR, retinoid X receptor; VDR, vitamin D receptor; THR, thyroid hormone receptor; LXR, liver X receptor; ERE, estrogen response element; RORE, ROR response element; PPRE, PP response element; VDRE, vitamin D response element; THRE, thyroid hormone response element; LXRE, LXR response element). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Motif grammar of bZIP proteins

In line with the symmetric basic structure of bZIP dimers, the direction of bZIP half-sites is always convergent, and the spacer appears to be integral part of the half-sites (Fig. 2). The most ancient bZIP motifs share the same half site and differ in spacer length [1]: 12-O-tetradecanoylphorbol-13-acetate (TPA) response element (TRE) contains a one-nucleotide long spacer (5′-TGA(C/G)TCA-3′), while cAMP-response element (CRE) has two spacer nucleotides (5′-TGACGTCA-3′). The former is optimal for members of the activator protein 1 (AP-1) families, primarily FOS/JUN heterodimers; and the latter can be bound by dimers of CRE binding protein (CREB)/activating transcription factor (ATF) family members (Fig. 2). Nevertheless, JUN proteins are capable of binding both TRE and CRE by forming heterodimers with different partners [46]. Certain bZIP dimers are specialized for longer sequences. Musculoaponeurotic fibrosarcoma (MAF) proteins, for example, recognize MAF response elements (MAREs) with 5′-TGCTGA(C/G)-3′ half-sites, which can be considered as upstream extended TRE/CRE half-sites [47], [48]. Small MAFs form heterodimers basically with Cap’n’collar (CNC)-type bZIP proteins, which, in turn, also bind the short 5′-TGA(C/G)-3′ half-site, so their shared composite element is essentially a TRE, extended by three nucleotides to one direction. In contrast, dimers of large MAFs bind the inverted repeat of MARE half-sites with a single C/G nucleotide or both in the middle (Fig. 2) [49]. Schematic representation of major bZIP motifs. TRE/CRE (5′-TGA(C/G)-3′) and MARE (5′-TGCTGA(C/G)-3′) half-sites are marked by blue arrows, C/EBP half sites are marked by green arrows, and spacer nucleotides are marked by black dots. Schematic motif logos and motif/protein names are indicated (TRE, TPA response element; CRE, cAMP response element; CREB, CRE binding protein; ATF, activating transcription factor; MAF, musculoaponeurotic fibrosarcoma protein; CNC, cap′n′collar-type bZIP protein; C/EBP, CCAAT/enhancer-binding protein; (s)MARE, (small) MAF response element; CARE, C/EBP:ATF response element). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) There are additional bZIP proteins that recognize TRE/CRE half-sites and are capable of interacting with bZIP proteins with different DNA binding features [46]. Several ATF proteins form heterodimers with CCAAT/enhancer-binding proteins (C/EBPs) and can have a composite element other than the merge of the two canonical half-sites. Within the ATF4-specific C/EBP:ATF response element (CARE), ATF4 binds a half-site with a unique spacer (5′-TGA(T)-3′), while its partner binds the canonical C/EBP half-site (5′-TT(GC)-3′) (Fig. 2) [31]. Interestingly, the C/EBP-bound genomic regions contain a large number of motifs built up from a strong and a weak C/EBP half-site, but this motif degeneracy is compensated by shape features created by a directly upstream flanking nucleotide of the half-sites, which can be any nucleotides other than T. This nucleotide preference is characteristic of other bZIP motifs, such as TRE/CRE half-sites, as well [31]. The binding of weak motifs can also be supported by the neighboring elements, as in the case of the promoter-proximal enhancer of inferferon-β (IFNB1) [50]. In the TRE of this enhancer, ATF2 has extended interactions within the major groove (5′-TGA(C)-3′), while JUN has a sub-optimal – essentially unrecognizable and barely bound – half-site (5′-TAT(G)-3′), which, in contrast, is optimal for minor groove binding by the neighboring IRF3. Interestingly, the downstream interferon-stimulated response element (ISRE) is also unusual, but this does not affect negatively the DNA-protein interactions and the formation of the so-called enhanceosome [50]. Interaction between AP-1 and IRF proteins is not restricted to this single site. Several AP-1:IRF composite elements (AICEs) could be identified in white blood cells, in which JUNB/BATF and IRF4/8 showed interaction at important immune response genes [51].

Motif grammar of nuclear receptors

Specific and selective DNA binding by NRs is determined by most aspects of motif grammar beyond the orientation and spacing, which are self-evident features for most dimer binding sites. Basically, there are two types of NR half-sites. Most NRs bind variants of the 5′-(A/G)GGTCA-3′ consensus half-site, while members of the NR3C steroid hormone receptor family have a 5′-AGAACA-3′ consensus sequence (Fig. 3) [2]. All steroid hormone receptor dimers bind IR elements separated by 3 nucleotides (IR3), and the additional NR dimers recognize a complete series of DR elements with zero to at least five nucleotide long spacers (DR0-5). Retinoic acid receptor (RAR, NR1B)/RXR heterodimers, for instance, bind a wide range of elements, namely DR0, DR1, DR2, DR5, DR8 (DR2:DR0), and also IR0 elements [52], [53]. Since most DR elements can be bound by more than one dimer, in their case, specificity should be made possible in other ways. Beyond their cell type-specific gene expression and dimer conformation that allow the recognition of different spacer lengths, NRs also adapted to additional sequence features, primarily the quality of the half-sites and their flanking sequences (Fig. 3). Schematic representation of nuclear receptor motifs. The general (5′-(A/G)GGTCA-3′) and the NR3C steroid hormone receptor-specific (5′-AGAACA-3′) half-sites are marked by blue or green arrows, respectively. 5′ extensions are marked by cyan arrows, and spacer nucleotides are marked by black dots. Schematic motif logos and motif/protein names are indicated (IR, inverted repeat; DR, direct repeat; ROR, retinoic acid receptor (RAR)-related orphan receptor; PPAR, peroxisome proliferator (PP)-activated receptor; RXR, retinoid X receptor; VDR, vitamin D receptor; THR, thyroid hormone receptor; LXR, liver X receptor; ERE, estrogen response element; RORE, ROR response element; PPRE, PP response element; VDRE, vitamin D response element; THRE, thyroid hormone response element; LXRE, LXR response element). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Certain NRs recognize a half-site longer than the consensus hexamer. All these sequences are extended upstream and some of them can be part of DR elements and thus serve specific motif recognition by NR dimers. Peroxisome proliferator (PP)-activated receptors (PPARs, NR1C), REV-ERB proteins (NR1D), and RAR-related orphan receptors (RORs, NR1F) belong to different NR classes based on their dimerization and DNA binding features, but all of these prefer 5′-(A)A(C/G)T(A/G)GGTCA-3′ sequences over shorter half-sites (Fig. 3). These NRs have a carboxy-terminal extension (CTE) of their DBD, and this interacts with the minor groove upstream to the DBD-bound half-site [8], [54]. Thereby PPAR/RXR heterodimers bind extended DR1 elements (PPREs) with higher affinity than other DR1s, and the other DR1-binding NRs, such as hepatocyte nuclear factor 4 (HNF4, NR2A) and testicular orphan receptor (TR, NR2C) dimers, might prefer DR1 elements with other features [2], [55]. Similarly, REV-ERB dimers prefer DR2 elements with 5′-A(C/G)T-3′ extensions both upstream to the DR2 and within the spacer (Rev-DR2s) [54]. As a result, the monomeric RORs are capable of PPRE and Rev-DR2 binding, which both include the ROR response element (RORE), and REV-ERBs are capable of repressing both on single ROREs and PPREs, which has functional consequences in the regulation of the circadian rhythm of the cells [56]. Besides these related NR families, there are additional ones with sequence preference within the minor groove. These all form heterodimers with RXR, bind the downstream half-site, and thus their specific 5′ extensions are in the spacer of DR elements (Fig. 3). Interestingly, liver X receptors (LXRs, NR1H2-3) have an extension similar to that of ROREs despite the significant structural differences. The DR4 of RXR/LXR (LXRE) contains a spacer with a preferred 5′-CTNN-3′ sequence, which is bound by the amino-terminal extension (NTE) of LXR [57], [58]. In contrast, in the case of the vitamin D receptor (VDR, NR1I1) and thyroid hormone receptors (THRs, NR1A), the CTE of DBD has an alpha-helical structure that crosses the minor groove in the spacer of DR3 or DR4, respectively [59], [60]. The CTE helix of THR interacts with more phosphate groups of both strands and the first two nucleotides of the 3′ half-site in the minor groove, yet it has a preferred nucleotide within the spacer, of which general sequence is 5′-NN(T/C)N-3′ [61]. VDR has an alternative, 7-nucleotide long 3′ half-site with a 5′-G(A/G)G(T/G)TCA-3′ consensus sequence, of which beginning shows similar interactions with the CTE helix as observed in the case of THR [60], [62]. The last class of NRs is the monomeric orphan receptors, which theoretically have a single half-site to bind, but since the variants of this consensus hexamer are very frequent and thus cannot provide specificity, these NRs – including RORs – also have 5′ extended monomer binding sites. As a result, NR4A proteins require a 5′-AA-3′ extension, and NR3B and NR5A proteins require a 5′-CA-3′ extension beyond the NR “half-site” (Fig. 3) [63], [64], [65]. This means that except for NR0B proteins, which have no complete DBD allowing DNA binding, only the NR2E family members have no extended half-site described, although this is one of the least examined NR family. Not only the monomeric, but also the dimeric NRs show motif degeneracy, although typically only one half-site is affected, and one strong half-site is required for stable DNA binding, like in the case of the C/EBP dimer binding sites. In the so-called half-site binding mode any half-site can provide strong DNA-protein interaction, and this contributes to the recruitment of the dimerizing partner with the weak half-site [66]. In the case of PPREs, both the PPAR and RXR half-site can be extended, and these provide more frequent binding than the weak and non-extended ones [67]. Importantly, half-site binding mode is dominant over the full site mode in the PPARγ cistrome of macrophages and adipocytes, which can be part of the optimization of enhancers to fulfill their gene regulatory roles but involves major technical limitations in the determination of sequence-specific direct binding events.

Summary and outlook

Naturally, our understanding of DNA-TF interactions is limited by the methodologies used. Initially, a few model or canonical sites were used for molecular biology and biochemical studies. These put restriction on and provided bias to building the motif grammar. In the last two decades, several ex vivo high-throughput methods, such as protein-binding microarrays, systematic evolution of ligands by exponential enrichment (SELEX)-based methods, and DNA affinity purification sequencing (DAP-seq) were developed to characterize the sequence features determining DNA binding by TFs. These showed not only the sequences optimal for DNA-TF interactions, but also their DNA methylation dependence, a wide range of possible TF-TF interactions through composite elements, as well as the significance of half-site binding mode [7], [15], [27], [66]. However, due to the fact that these cannot take live cellular conditions – such as the neighboring or farther interacting sequences and the concentration of collaborating or competing TFs and other chromatin components – into consideration, several features, for instance those allowing the binding of suboptimal sites, could not be broadly tested. A successful way to investigate this phenomenon is to compare organisms with different genotypes or generate mutants and show the alterations in DNA binding – although these are not high-throughput approaches [22], [23], [24], [25]. In contrast, the availability of cistromic (chromatin immunoprecipitation sequencing, ChIP-seq) data sets allows an unbiased assessment, although it is always a great challenge to discriminate direct and indirect DNA binding events, since bioinformatic approaches typically cannot identify weak binding sites, although these can be as functional as canonical elements in certain conditions. Identifying all functional units of regulatory regions requires more sophisticated, future approaches.

CRediT authorship contribution statement

Gergely Nagy: Conceptualization, Writing - original draft, Visualization. Laszlo Nagy: Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

67 in total

1. Retinoic acid receptors recognize the mouse genome through binding elements with diverse spacing and topology.

Authors: Emmanuel Moutier; Tao Ye; Mohamed-Amin Choukrallah; Sylvia Urban; Judit Osz; Amandine Chatagnon; Laurence Delacroix; Diana Langer; Natacha Rochel; Dino Moras; Gerard Benoit; Irwin Davidson
Journal: J Biol Chem Date: 2012-06-01 Impact factor: 5.157

2. The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins.

Authors: W H Landschulz; P F Johnson; S L McKnight
Journal: Science Date: 1988-06-24 Impact factor: 47.728

3. Identification of target genes and a unique cis element regulated by IRF-8 in developing macrophages.

Authors: Tomohiko Tamura; Pratima Thotakura; Tetsuya S Tanaka; Minoru S H Ko; Keiko Ozato
Journal: Blood Date: 2005-06-09 Impact factor: 22.113

4. Crystal structure of a human TATA box-binding protein/TATA element complex.

Authors: D B Nikolov; H Chen; E D Halay; A Hoffman; R G Roeder; S K Burley
Journal: Proc Natl Acad Sci U S A Date: 1996-05-14 Impact factor: 11.205

5. Steroidogenic factor I, a key regulator of steroidogenic enzyme expression, is the mouse homolog of fushi tarazu-factor I.

Authors: D S Lala; D A Rice; K L Parker
Journal: Mol Endocrinol Date: 1992-08

6. Structural basis of alternative DNA recognition by Maf transcription factors.

Authors: Hirofumi Kurokawa; Hozumi Motohashi; Shinji Sueno; Momoko Kimura; Hiroaki Takagawa; Yousuke Kanno; Masayuki Yamamoto; Toshiyuki Tanaka
Journal: Mol Cell Biol Date: 2009-09-21 Impact factor: 4.272

7. A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs.

Authors: Md Abul Hassan Samee; Benoit G Bruneau; Katherine S Pollard
Journal: Cell Syst Date: 2019-01-16 Impact factor: 10.304

8. Meis1, a PBX1-related homeobox gene involved in myeloid leukemia in BXH-2 mice.

Authors: J J Moskow; F Bullrich; K Huebner; I O Daar; A M Buchberg
Journal: Mol Cell Biol Date: 1995-10 Impact factor: 4.272

Review 9. Protein-DNA binding: complexities and multi-protein codes.

Authors: Trevor Siggers; Raluca Gordân
Journal: Nucleic Acids Res Date: 2013-11-16 Impact factor: 16.971

10. A single ChIP-seq dataset is sufficient for comprehensive analysis of motifs co-occurrence with MCOT package.

Authors: Victor Levitsky; Elena Zemlyanskaya; Dmitry Oshchepkov; Olga Podkolodnaya; Elena Ignatieva; Ivo Grosse; Victoria Mironova; Tatyana Merkulova
Journal: Nucleic Acids Res Date: 2019-12-02 Impact factor: 16.971

4 in total

1. A growth factor-expressing macrophage subpopulation orchestrates regenerative inflammation via GDF-15.

Authors: Andreas Patsalos; Laszlo Halasz; Miguel A Medina-Serpas; Wilhelm K Berger; Bence Daniel; Petros Tzerpos; Máté Kiss; Gergely Nagy; Cornelius Fischer; Zoltan Simandi; Tamas Varga; Laszlo Nagy
Journal: J Exp Med Date: 2021-11-30 Impact factor: 14.307

2. Asymmetric Conservation within Pairs of Co-Occurred Motifs Mediates Weak Direct Binding of Transcription Factors in ChIP-Seq Data.

Authors: Victor Levitsky; Dmitry Oshchepkov; Elena Zemlyanskaya; Tatyana Merkulova
Journal: Int J Mol Sci Date: 2020-08-21 Impact factor: 5.923

3. Integrative analysis reveals multiple modes of LXR transcriptional regulation in liver.

Authors: Lara Bideyan; Wenxin Fan; Karolina Elżbieta Kaczor-Urbanowicz; Christina Priest; David Casero; Peter Tontonoz
Journal: Proc Natl Acad Sci U S A Date: 2022-02-15 Impact factor: 11.205

Review 4. Learning the Regulatory Code of Gene Expression.

Authors: Jan Zrimec; Filip Buric; Mariia Kokina; Victor Garcia; Aleksej Zelezniak
Journal: Front Mol Biosci Date: 2021-06-10

4 in total