Literature DB >> 21486752

Finding subtypes of transcription factor motif pairs with distinct regulatory roles.

Abha Singh Bais¹, Naftali Kaminski, Panayiotis V Benos.

Abstract

DNA sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream-regulatory effects. Experimentally identified TF binding sites (TFBSs) are usually similar enough to be summarized by a 'consensus' motif, representative of the TF DNA binding specificity. Studies have shown that groups of nucleotide TFBS variants (subtypes) can contribute to distinct modes of downstream regulation by the TF via differential recruitment of cofactors. A TF(A) may bind to TFBS subtypes a(1) or a(2) depending on whether it associates with cofactors TF(B) or TF(C), respectively. While some approaches can discover motif pairs (dyads), none address the problem of identifying 'variants' of dyads. TFs are key components of multiple regulatory pathways targeting different sets of genes perhaps with different binding preferences. Identifying the discriminating TF-DNA associations that lead to the differential downstream regulation is thus essential. We present DiSCo (Discovery of Subtypes and Cofactors), a novel approach for identifying variants of dyad motifs (and their respective target sequence sets) that are instrumental for differential downstream regulation. Using both simulated and experimental datasets, we demonstrate how current motif discovery can be successfully leveraged to address this question.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2011 PMID： 21486752 PMCID： PMC3113591 DOI： 10.1093/nar/gkr205

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Transcription factors (TFs) are DNA binding proteins that recognize and bind a small set of similar DNA binding sites with high specificity to regulate the expression of multiple target genes. The binding sites of a TF can be determined using in vitro or in vivo techniques. The former include methods like SELEX and its variants (1,2) and protein binding microarrays (PBMs) (3), whereas the latter routinely involve chromatin immunoprecipitation (ChIP) coupled with either microarrays (ChIP-chip) or deep sequencing (ChIP-seq) [for reviews see Ref. (4,5)]. In most cases, experimentally found binding sites of a TF are similar enough to be summed up by a ‘consensus’ motif or a ‘position weight matrix’ [for an excellent review, see Ref. (6)]. In some cases, however, a TF may show different binding preferences in vitro and in vivo (7,8). Furthermore, even in vivo, a TF may bind variants of its main (or ‘canonical’) motif to target distinct downstream genes [(9–11); reviewed in (7)]. Such variant sites are referred to as ‘non-canonical’ sites and may contribute to distinct modes of downstream regulation (12–16). The precise nucleotide sequence of a TFBS plays an important role, not only in attracting the corresponding TF, but also in the recruitment of its cofactors and hence in the mode of regulation of its targets. A TF A may bind to TFBSs of subtype a1 or a2 and target different genes depending on whether it associates with cofactor B or C, respectively (Figure 1). For example, in Escherichia coli, the cyclic AMP receptor protein (CRP) binds a 22-bp consensus motif CRP-N to regulate roughly 100 genes involved in the response to sugar starvation in the cell (17). In Haemophilus influenzae, CRP recognizes the typical CRP-N sites, but it also recognizes and binds to a CRP-N variant, the non-canonical CRP-S motif (18,19). The CRP-S sites are found in the promoters of genes involved in ‘competence’, a process by which cells can take up DNA from the environment, and which require the presence of both CRP and another protein, Sxy, for transcriptional activation. A similar functional CRP-S regulon has also been identified in E. coli (13,20), suggesting possibly a similar mechanism. Hence, the protein CRP recognizes two highly similar but distinct motifs, subtypes CRP-N and CRP-S, in the presence or absence of Sxy thus regulating different sets of downstream genes. Single nucleotide changes too have been associated with the choice of cofactors and the set of target genes as in the case of the glucocorticoid receptor (16,21), NF-κB (22), Pit-1 (23), Foxa2 (24), etc. In the following, we use the term ‘subtypes’ of TFBSs to refer to groups of highly similar but distinct nucleotide variants of the canonical binding motif of a TF.

Figure 1.

Dyad-dependent modes of regulation. A TF A with DNA binding specificity a may bind subtypes, a1 and a2, of binding sites depending on whether they co-occur with binding sites b and c of cofactors B and C, respectively. Alternatively, a TF that binds as a dimer may bind dyads a1:b or a2:c depending on the sets of downstream targets and/or vice versa. Finally, two distinct TFs with highly similar but distinct binding sites a1 and a2, may recognize and bind their respective TFBSs by associating with corresponding cofactors B and C, respectively. Logos represent the consensus motifs of the TFs and d is the maximum inter-site distance. There is a plethora of computational TFBS motif discovery methods [reviewed in Ref. (25–27)]. Typically these methods ignore differences between canonical and non-canonical motifs. Almost all tend to average slight variations of the enriched motifs into ‘consensi’, thereby not identifying TFBS subtypes. An initial attempt to discover subtypes of single TFBS motifs was a kernel estimations-based method presented by Kel et al. (28). The authors tested their method on similar TFBSs of two distinct TFs, AP-1 and CREB. Hannenhalli et al. (29) and Georgi et al. (30) used mixture-model representations of existing position-specific probabilistic models for TFBSs to account for subtypes of TFBSs. Computational studies analyzing ChIP-chip/seq data have also uncovered non-canonical variants of the main TFBS motifs using information about distance from the TSS or presence or absence of cofactor TFBSs (9,12,21,24). In some cases, the discovered subtypes of TFBSs had similar sequence consensus overall but varying strengths of the sites (24). In other cases, the two motifs were drastically different in sequence composition as well as in length (9). As useful as these approaches are, they do not systematically address the fact that TF binding preferences may change depending on different cofactor associations. Few motif discovery approaches focus on dyad motif discovery. A dyad motif consists of two motifs separated by a certain number of bases. The predictions from dyad-based methods can be used further for building cis-regulatory modules comprised of more than two components. Guhathakurta et al. (31) and Liu et al. (32) proposed Gibbs-sampling-based approaches to identify dyads, defined as cooperatively acting TFs, two-block motifs or gapped motifs. The method BIPAD (33) predicts a dyad motif pattern, or a ‘two-block structured motif’ with a given maximum inter-site distance by optimizing over the total information content (IC) of the two half-sites. Another approach GADEM (34) uses a genetic algorithm followed by expectation-maximization to discover ‘spaced dyads’, defined therein as two words that are separated by a spacer. While most of these dyad-discovery methods are PWM based, the approaches of van Helden et al. (35) and Eskin et al. (36) are consensus sequence based. The former uses word counting followed by estimation of statistical significance to discover spaced dyad elements, defined as a pair of short conserved words separated by a region of fixed size and variable content. The latter uses a single motif discovery step to identify putative instances of composite signals followed by an exhaustive search. Here, composite patterns are referred to as groups of ‘monads’, or single motifs, that occur near each other, with dyads being the special case of a pair of monads that appear approximately within a given distance. In another work, Smith et al. (37) use a set of candidate motifs that best predict localization data and search the sequence neighborhood for putative cofactor motifs. None of the aforementioned approaches addresses the question of identifying variants of the same motif of a TF A, depending on whether it binds to DNA in the presence of cofactors B or C (Figure 1). In the case of single motif discovery, an average motif may be calculated for A; whereas in the case of dyad motif discovery, a method may fail to detect any dyad motif if the TFBSs of B and C are substantially different. Since TFs usually participate in multiple regulatory pathways, sometimes targeting different sets of genes with perhaps different binding preferences, it is essential to identify the discriminating TF–DNA associations that lead to different downstream target genes. We present DiSCo (Discovery of Subtypes and Cofactors), an integrated approach to identify variants of TFBS dyad motifs and the subsets of sequences that reflect the different modes of regulation adopted by the TF. We refer to motifs of same length that differ marginally as different subtypes of the main TFBS. While this definition precludes examples where motifs of different lengths have been discovered as being bound by the same TF, it enables us to distinguish TFBS subtypes based on variation purely in sequence composition and not length differences. For simplicity, we do not take into account position dependencies within a motif. As mentioned earlier, we refer to pairs of binding sites that occur within a maximum distance, say d, as dyads. Slight variations from the consensus in either one or both components of the dyad are termed as dyad subtypes. Unless otherwise stated, the order or the orientation of the motifs inside a dyad is not considered. Our assumption is that a TF may adopt different binding preferences depending on the constitution of enriched dyads in a set of target sequences (Figure 1), which could lead to different modes of downstream regulation. Although it is possible that a TF adopts more than two distinct modes of DNA binding and regulation along with associated cofactors, to the best of our knowledge, no such examples have been observed yet. Note that while our main goal is to study subtype-co-factor dependent modes of TF binding, TFs that bind as dimers with varying dimer composition also fall into the same technical framework. Finally, DiSCo can be used to differentiate between highly similar DNA motifs of two different TFs. To highlight the applicability of our approach, we utilize an existing PWM-based tool, BioProspector (32), which performs both single and dyad motif discovery, and is available for download and use. However, the approach we present here is general and allows the use of other dyad-discovery tools. We test DiSCo on simulated and experimental data, and demonstrate how existing motif discovery algorithms can be leveraged to yield subtypes of pairs of TFBS motifs as well as the sequence subsets that are enriched with them. The DiSCo software consists of Perl scripts and R code for clustering of TFBSs found using the freely available tool BioProspector, and is available from the authors on request.

MATERIALS AND METHODS

Let A be a TF with a DNA binding speciﬁcity represented by the PWM P. Let P1 and P2 be the PWMs representing two subtypes of the consensus binding site of A that are associated with motifs P and P, respectively, within a maximum distance d. Let S = S1 ∪ S2: be the set of sequences bound by A and containing instances of motifs P1 and P2, respectively. In this setting, a standard dyad discovery (SDD) method may yield motif pairs that cluster together instances of both kinds of subtype-cofactor pairs and hence have poorer discriminating performance. We would like to have a method that better characterizes the probabilistic models P1, P2, P and P, and partitions S into subsets S1: and S2:.

Approach

We propose a novel approach for dyad motif subtype discovery, based on dyad enrichment that distinguishes target sequence sets. The general idea is as follows (Supplementary Figure S1). Sequences are partitioned into two clusters based on initial predictions of a dyad motif discovery algorithm. Each of the two clusters is then subject to a second round of dyad motif discovery. Focusing on subsets of sequences with highly similar binding sites increases the signal-to-noise ratio, which in turn enables higher quality prediction of binding site subtypes. At the same time, the dyad motif discovery yields TFBS subtype-cofactor relations and—if the query dataset is a set of promoters—the respective downstream targets. Let S1,…, S be the N sequences containing the targets of TF A. Let W and w be the widths of the two components of the dyad which are enriched within a maximum inter-site distance of d.

Step 1: de novo dyad discovery

In this article, motifs are discovered using BioProspector (32), although, as mentioned earlier, any appropriate dyad-discovery tool can be used. BioProspector is a Gibbs-sampling-based method adapted for efficient gapped motif discovery and consideration of higher order background models. In the predictive-update step, it initializes two motif models by randomly choosing two positions, on the same strand in each sequence Si such that The models are constructed by using the substrings of lengths W and w, respectively, starting at In the sampling step, the randomly chosen pair in each sequence is scored as where are the respective probabilities of generating under the two corresponding motif models and is the background distribution. For every sequence, new positions for are sampled with a probability proportional to and the corresponding substrings are used to formulate two new motif models. The position for a substring x1 for the first motif is sampled using its marginal distribution where the sum is over all substrings of width w within the gap range d downstream from x1. Then the position for segment is chosen with probability conditioned on . In the default dyad-discovery mode, BioProspector searches both strands of the input sequences for motif pairs with widths W and w, respectively, within a gap range [gmin, gmax] (d = gmax − gmin) such that not all sequences need contain a copy of the motif pair. After multiple initializations (default = 40) to avoid local optima, it outputs the top five scoring motif pairs with the corresponding sites, locations, strands and scores in the relevant sequences. The scores correspond to the average IC of the two halves of a dyad. Hence for a motif pair, there can be multiple instances of site pairs predicted in a sequence. If no background model file is given, the input sequences are used to calculate the background. For our approach the following settings of BioProspector are used: (i) each sequence is searched for at least one occurrence of the motif pair with a maximum gap of gmax = d, hence gmin = 0, and (ii) only the locations of the highest scoring site pair in every sequence for the top ranking motif pair are stored in each search. This implies that for every sequence S, Step 1 yields position pairs corresponding to the best scoring site pair under the top ranking motif pair. For the purposes of the present study, the results of Step 1 correspond to those of a SDD method.

Step 2: Site-based clustering

Let be the substrings of lengths W and w, respectively, starting at positions in S. Our aim is to cluster together sequences that have highly similar binding sites of one or both components in the dyad. Hence, for all sequences with a predicted site pair, we calculate the pairwise proximity matrix D(i, j) by comparing the respective highest scoring site pairs The comparisons can be between the sites of the first motif (i.e. with respect to the first motif, parameter wrt = 1), sites of the second motif (wrt = 2) or sites of both motifs (wrt = 3). Unless otherwise stated, we calculate the Hamming distance (HD) between the sites of both motifs (wrt = 3) in a pair of sequences S, S, i.e. We note that other distance metrics can be used too. Once the pairwise proximity matrix is formulated, we use the iterative clustering method of Dubnov et al. (38) for clustering. This is a non-parametric approach predicted to converge to two broad clusters in most cases via a transformation on the pairwise proximity matrix followed by hierarchical clustering. The pairwise proximity matrix D(i, j) constructed from the HD is iteratively transformed according to the Jensen–Shannon divergence (39) and two broad clusters extracted.

Step 3: Motif discovery on clusters

In a sequence, the location of the site corresponding to the ﬁrst motif in the best scoring site pair is used as seed for the next round of motif discovery in the respective cluster the sequence belongs to. This means that in Gibbs sampling the ﬁrst motif is initialized at the locations of the corresponding sites predicted in the previous round. Motif discovery is performed on both clusters yielding four newly discovered PWMs, two from each cluster. The results of this motif discovery after clustering are referred to as those of our algorithm DiSCo, i.e. Steps 1–3 constitute a single run of DiSCo. While the algorithm can be iterated until convergence, we observed that a single iteration has sufficient discriminating power and there is no practical improvement in performance after the first iteration (data not shown). Hence, in the following tests, we present results where we only run DiSCo (Steps 1–3) once.

Post-processing of stochastic dyad motif finder results

Our approach uses the initial predictions of a dyad motif-finding program like BioProspector (32). In general, since most dyad-discovery methods are stochastic, their outputs may vary in each run. Thus, for the analysis of biological examples, our algorithm includes a post-processing step, in which the pipeline runs multiple times and the final output is decided by majority polling. Consider the motif-discovery step of either before or after clustering. On running motif discovery ‘Nrun’ times, for each sequence, the pair of sites predicted most often across all runs is considered a ‘robust’ prediction. Similarly, the run in which the output PWMs are composed of robust pairs of sites for the majority of sequences is likely to be the one with the most robust output. That is, the PWMs are formulated from sites that are predicted most often across ‘Nrun’ runs. For both SDD and DiSCo, we select the run that yields the most robust site pairs over all runs for the majority of sequences as the final output. Hence, both SDD and DiSCo end up predicting only one dyad occurrence per sequence. Suppose that after SDD, the highest scoring site pair in sequence S in the j-th run is located at Hence, over ‘Nrun’ runs, we have the locations of the highest scoring site pair for sequence Let be the robust prediction for sequence S. That is, it is predicted most often to be the location of the highest scoring site pair in Nrun runs, say it is predicted in runs. Hence, for each sequence S, we have R as the set of runs with the robust location as the final prediction. We choose the output of the run that predicts robust locations for all sequences. Clearly, it is possible that no single run predicts the robust locations of all sequences, in which case we pick a run that predicts it for the maximum number of sequences. In case all Nrun runs yield different predictions, we choose the prediction of a randomly selected run as If more than one location pair is predicted often and same number of times, we randomly choose one to be the robust location and so on. The same procedure is followed for DiSCo. The dyad(s) discovered after the majority polled run is taken as the final output of the method (SDD or DiSCo). To provide an estimate of the significance of the motif found, a P-value for each site of each of the motifs constituting the final dyad(s) is also calculated. The P-value is calculated by simulating the background motif score distribution by sampling N = 100 000 instances from the input nucleotide distribution. The P-value of a site corresponds to the probability of such a random instance having a motif log-odds score at least as high as that of the site. Additionally, Monte Carlo simulations can be performed by running DiSCo with the same parameters on randomly generated datasets with the same sequence number, length and nucleotide distribution as the original data and comparing the mean motif scores of the final dyads found.

Evaluation datasets

Simulated data

First, we compare the performance of DiSCo (i.e. results after clustering in Step 3) versus SDD (i.e. results after Step 1) in a controlled setting. Simulated datasets are constructed by implanting binding sites sampled from artificial PWMs (10 bp long), of varying IC into 20 (200-bp long) random sequences sampled from the uniform background distribution. The artificial PWMs were generated from randomly selected matrix columns in the TRANSFAC (40) database (version 11.2). The TRANSFAC matrix columns were categorized as being of low (L; 1 ≤ IC ≤ 1.5) or high (H; 1.5 ≤ IC ≤ 2) IC. To create a PWM of length W of high (or low) IC, we randomly choose W number of high (or low) IC columns. For generating the subtype a2 of motif a1, a proportion δ (δ = 0%, 30% or 40%) of the columns of a1 are randomly selected and the consensus nucleotide (A, C, G or T) is permutated. In each sequence, a site sampled from one matrix (say a1 or a2) is implanted at a random location followed by one sampled from another matrix (either b or c) within a randomly selected distance d' ≤ d = 20. For each dataset, two kinds of sequences with different site pairs are constructed and the known locations of the implanted sites stored. One complete dataset S is made up of two subsets of sequences: S1 with sites sampled from motif models of subtype a1 and motif b (a1:b), and S2 with sites from subtype a2 and motif c (a2:c). Without loss of generality, sites are implanted in the forward strand only. Since, for each sequence, only the best scoring site pairs are reported, we calculate the true positive (TP) and false positive (FP) sites by comparing the predictions after SDD and DiSCo with the known locations. Steps 1–3 are run multiple times (Nrun = 20) to calculate average sensitivity. For de novo dyad discovery, both in SDD at Step 1 and DiSCo in Step 3, we search for pairs of motifs of widths W = w = 10 with a maximum gap of d = 20. Only the forward strand is searched for at least one instance of the motif pair. BioProspector is run for the default number of initializations and the sites of both motifs are used for calculating the HD at the clustering step (i.e. wrt = 3).

CRP dataset

The E. coli CRP protein is a well-studied TF known to regulate hundreds of genes involved in carbon-energy starvation response amongst other functions [for a review see Ref. (41)]. In E. coli, CRP typically recognizes and binds a 22-bp symmetric motif (CRP-N) AAATGTGA(N6)TCACATTT (42). In H. influenzae, CRP binds not only the typical CRP-N sites, but also a variant CRP-S, which is enriched in the promoters of competence genes (18,19). This variant differs from CRP-N sites at one position in the core of each of the half-sites, TG(N6)TC in CRP-S instead of TG(N6)TC in CRP-N (Figure 3), and requires CRP and another protein, Sxy, for transcriptional activation (19,43). Recently, Sinha et al. (20) identiﬁed the equivalent CRP-S regulon in E. coli and demonstrated its requirement of both CRP as well as Sxy for transcriptional activation. The fact that regulation of the competence genes requires the CRP-S and not the CRP-N motif shows that differences in the binding motifs (and associated cofactors) can be directly associated with differential regulation of certain categories of genes.

Figure 3.

Motif logos of the CRP-N and CRP-S sites of H. influenzae and E. coli. Haemophilus influenzae CRP-N (A) and CRP-S (B) motifs created from sites retrieved from the Supplementary Tables of Cameron et al. (13); E. coli CRP-N (C) and CRP-S (D) motifs created from sites retrieved from (44) (CRP-N) and (19) (CRP-S). Logos generated using enoLOGOS (49).

We focus on 300 bases upstream of the CRP-N and CRP-S target genes in both E. coli and H. influenzae. Using the available CRP matrix (44), we retrieved the list of the E. coli CRP-N genes with experimentally found sites within 300-bp upstream of the TSS. For the E. coli CRP-S sequences, we used the 34 transcriptional units found to be differentially regulated by Sxy and considered putative CRP-S genes by Sinha et al. (20). For both sets, we extracted the corresponding upstream sequences from the E. coli K12 genome using RSATools (35), allowing for overlaps and not admitting imprecise positions. For completeness, we manually obtained sequences for genes whose identifiers did not yield matches. This resulted in a total of 25 CRP-N sequences and 37 CRP-S sequences in E. coli with 48 and 43 sites, respectively. The H. influenzae CRP-N and CRP-S genes were taken from the Supplementary Data of Cameron et al. (13). Using the complete genome sequence and coordinates information of the H. influenzae Rd KW 20 genome as downloaded from The Comprehensive Microbial Resource (45), we extracted the upstream regions for the respective transcriptional units for the two sets that contained 41 and 13 sequences respectively.

AP-1 and CREB dataset

We also considered two distinct TFs, AP-1 and CREB, that have binding sites that differ slightly at the 3′ end of the core regions (see below). A mixture of TFBSs retrieved from the TRANSFAC database (40) for these two TFs was used by Kel et al. (28) in a study aimed to distinguish between subtypes of single patterns. Their method successfully discovered two motifs of lengths 7 and 8 that were identical to the TRANSFAC motifs of AP-1 and CREB, respectively. Our focus is not just extracting subtypes of TFBSs but also the motifs from the surrounding regions associated with potential cofactors. Hence, for the present study, we retrieved 50 bp around the sites contributing to the matrices for AP-1 and CREB in TRANSFAC (40) (version 11.2) (IDS V$AP1_Q2_01 and V$CREB_Q4_01, respectively). Although there are numerous matrices for AP-1 and CREB in TRANSFAC, we choose to use these since they are formulated from experimentally derived sites from vertebrate species. We retrieved sequences from EBI using ‘dbFetch’ for the sites with EMBL IDs in the ‘site.dat’ ﬁle of the TRANSFAC database. The sites were then mapped onto the EBI sequences and ±50-bp regions extracted for analysis. We ran DiSCo for Nrun = 20 times and compared the resulting PWMs with the TRANSFAC database using STAMP (46), with no trimming for weak IC columns and default settings otherwise.

NF-κB dataset

Using a computational approach, Busse et al. (14) studied the NF-κB TFBSs in the regulatory regions of genes activated in the Toll and Imd pathways in Drosophila. They identiﬁed pathway-speciﬁc characteristics of the κB sites: nearly two-thirds of the Imd-speciﬁc promoter set had a GGGGA 5′ half-site. This site is absent in the Toll-speciﬁc set which instead had either GGGA or GGAA 5′ half-site in the upstream regions of the corresponding genes (14). We retrieved the 16 and 11 Toll- and Imd-specific sequences with a total of 17 and 21 sites, respectively, from the Supplementary Data of Busse et al. (14).

RESULTS

Evaluation of performance on simulated data

We study the problem of identifying dyad subtypes through simulated datasets that emulate sequences with binding site pairs that vary in signal strengths. The aim is to compare the performance of SDD (standard de novo dyad discovery before clustering) with DiSCo (dyad discovery after clustering), in a controlled setting. To this end, we generate sequence datasets with pairs of sites implanted at a maximum inter-site distance d. Sites are sampled from artificial matrices a1, b, a2 and c of varying IC. Each dataset, S, consists of two kinds of sequence subsets, S1 and S2, such that S1 has implanted sites sampled from motif models a1 and b and S2 has those sampled from a2 and c, respectively. Here, a1, a2 represent subtypes of a motif with varying amounts of dissimilarities between each other (see ‘Materials and Methods’ section). The predicted locations after SDD and DiSCo are compared with the implanted locations to calculate the TP and FP predictions. Since the number of known and predicted sites per sequence are equal, we use the average sensitivity (TP/TP + FN) over multiple runs for performance evaluation. The results are presented in Figure 2.

Figure 2.

DiSCo outperforms SDD on simulated datasets. The average sensitivity after Standard dyad discovery (SDD; white) and DiSCo (grey) as the proportion of dissimilarity between the subtypes a1 and a2 increases from δ = 0%, 30% and 40%, is shown. The bars are plotted for datasets with increasing mean total IC of the constituting matrices with the labels denoting the individual IC levels for the first (a1 or a2) and second motifs (b or c) used to construct a dataset. For example, ‘L, H’ denotes the datasets where motif models a1 and a2 are of low IC levels and b and c of high IC levels, respectively. L: low IC and H: high IC category, respectively. The error bars indicate standard errors. The proportion of dissimilarity is measured as the percentage of dissimilar positions between two matrices. In general, irrespective of the average total IC of the matrices used, the performance of de novo discovery both before and after clustering deteriorates as the similarity between a1 and a2 decreases (i.e. from δ = 0% to 40%, Figure 2). Clearly, the presence of sites that are sampled from the same or very similar motifs throughout the sequence set enables the SDD to yield more correct predictions, which also helps DiSCo. As the dissimilarity between the two subtypes increases, SDD deteriorates in performance yielding more false predictions. Since the output of SDD is used to seed the DiSCo motif discovery, incorrect predictions after SDD negatively affect DiSCo. However, the decrease in average sensitivity for DiSCo is less than that of SDD as the similarity between a1 and a2 decreases. For most matrix combinations, DiSCo yields a significantly higher average sensitivity than SDD, a direct consequence of the increased signal-to-noise ratio after clustering. In the case where both matrices are of low IC, both methods perform poorly, with no significant difference in their average sensitivity values. Again, at low IC, the predicted sites for both motifs after SDD are poor, which in turn influences the behavior of DiSCo. Since all motifs are poorly informative, putative predicted sites of a motif after SDD are inclined to be quite dissimilar to each other making them more unlikely to fall in the same cluster, leading to a detrimental effect on the performance of DiSCo. Finally, as expected, the increase in the average total IC of the constituting matrices improves the performance of both methods. The case of the (L, H) dataset is interesting. This is where the first and second motifs have low and high IC, respectively. A drop in the average sensitivity values is observed from the (L, H) to (H, L) dataset even though there is an increase in the total average IC. This is a consequence of using BioProspector, which is designed to find the strongest motif first and then search only downstream for the second motif. Hence, here BioProspector discovers the stronger motif (in this case, the second motif) first, which leads to increased false predictions for the weaker motif [located upstream in the (L, H) dataset] yielding a dip in sensitivity values. Again, although this affects negatively both SDD and DiSCo, DiSCo performs better (Figure 2). We performed similar simulations with search parameters changed to motif widths of W = w = 8 and got similar results. Additionally, we generated and analyzed simulated datasets with increased inter-site distance (d = 50 bp) and the results were practically the same (data not shown). In summary, as the difference between a1 and a2 increases, there is an improvement in DiSCo performance due to clustering.

Evaluation on the CRP dataset

CRP motifs provide an appropriate dataset for testing our algorithm for the following reasons. Although both CRP-N half-sites are highly similar between E. coli and H. influenzae (TGTGA), the CRP-S sites differ between the two species, especially in the second half-site. In particular, both species have similar first half-sites in the CRP-S motifs, except in the last base (Figure 3B and D); whereas the second half-site, that lies 6-bp away, appears to be species specific. Thus, we can treat the two half-sites as two separate motifs: one that is shared amongst all sequences and another that differs between the two subsets (in this case, the two subsets come from two different organisms). Focusing on a single species, the complete CRP datasets can also be used to test DiSCo. In H. influenzae the two CRP motifs vary in both half-sites (Figure 3A and B). Here, one type of dyad is enriched in CRP-S related promoters and the other in CRP-N related ones. In this case, both components of the dyads are speciﬁc to a (different) regulatory mode—competence or non-competence related. Lastly, in both species, the complete 22-bp CRP sites have variants CRP-S and CRP-N, depending on whether they are derived from promoters of competence or non-competence genes, respectively. In the former case, CRP binding requires the presence of the protein Sxy. Here, one TF (CRP) binds to subtypes of TFBSs, CRP-N and CRP-S, respectively, depending on the presence of cofactors. This provides an opportunity to explore novel motifs that might be speciﬁcally associated with the regulation of competence genes by CRP. Motif logos of the CRP-N and CRP-S sites of H. influenzae and E. coli. Haemophilus influenzae CRP-N (A) and CRP-S (B) motifs created from sites retrieved from the Supplementary Tables of Cameron et al. (13); E. coli CRP-N (C) and CRP-S (D) motifs created from sites retrieved from (44) (CRP-N) and (19) (CRP-S). Logos generated using enoLOGOS (49). The different aspects of the CRP dataset ﬁt well into our technical framework and provide an excellent test case scenario. We use the E. coli and H. influenzae CRP datasets to demonstrate the applicability of DiSCo in a biological setting through three main analyses. First, as a proof of principle we run DiSCo on the pooled set of CRP-S sequences from these two species to automatically partition them into two species-speciﬁc clusters and identify the dyads enriched in each. The two species-specific datasets have similar first half-sites but different second half-sites. Second, we run DiSCo on the set of pooled CRP-N and CRP-S sequences from H. influenzae in order to predict the two distinct dyad types and their corresponding targets. In this case, both first and second half-sites are different between the two datasets. Finally, we use DiSCo to study putative motifs that might be associated with each complete CRP dyad subtype in H. influenzae. In each case, site P-values are calculated as described in ‘Materials and Methods’ section. The maximum P-values for each motif output are presented in Supplementary Table S1.

Similar first half-sites, different second half-sites

We studied the pooled E. coli and H. influenzae CRP-S-associated promoter sequences. The motifs of the best scoring sites in the majority polled runs of SDD and DiSCo, when searching both strands for motif pairs of widths 5 each at a distance of 6 bp, are presented in Figure 4. Both methods discover the first half-site of the two species-specific TFBSs, though in one case the predicted motif is slightly shifted. For the second half-site, the motif discovered in SDD resembles more closely the second half-site of the E. coli CRP-S motif. Clearly, the greater number of E. coli CRP-S related sequences (37 E. coli versus 13 H. influenzae CRP-S) dominated the motif discovery in SDD, resulting in a motif that seems to average the individual dyads. In contrast, DiSCo identifies two sequence clusters with enriched dyads that correctly match the CRP-S motifs of H. influenzae (C1) and E. coli (C2). Since the first half-sites are very similar between the two species and present in almost all sequences, it is easy for SDD to correctly predict the locations of the first motif in the sequences. Using these predicted locations to initialize the motif discovery after clustering enables correct predictions for this motif in DiSCo. This ultimately leads to better identification of species-specific sequences as well as dyad subtypes. All H. influenzae CRP-S sequences correctly fall in the same cluster and majority of E. coli CRP-S sequences are clustered together, with only ∼24% of E. coli CRP-S sequences being misclassified (i.e. clustered with H. influenzae CRP-S). It should be stressed here that the pooled and weak second half-site discovered using the SDD is not necessarily the shortcoming of the underlying method used (in this case, BioProspector). Given the limitations of experimental knowledge, most motif discovery methods are inclined to do the same. However, as demonstrated, it is feasible to use the current state-of-the-art methods to further address questions of binding site subtypes.

Figure 4.

Motifs found on CRP-S sequences of E. coli and H. influenzae. The dyads discovered in the majority polled run after SDD (top row) and DiSCo (two bottom rows) are shown. SDD yields a dyad whose components predominantly resemble the two half-sites of the E. coli CRP-S motif. In contrast, DiSCo identifies clusters C1 and C2; C1 is enriched with a dyad similar to the H. influenzae CRP-S motif (Figure 3B) and C2 is enriched with one that is similar to the E. coli CRP-S motif (Figure 3D). Average misclassification error = 0.2045.

Different first and second half-sites

Next, we analyzed the pooled CRP-N and CRP-S related sequences in H. influenzae with resulting motifs shown in Figure 5 (both strands searched for motif pairs of widths 8 each at a distance of 6 bp). Here, besides the difference in the cores, the differences in the surrounding regions of the two half-sites in CRP-S and CRP-N are also prominent. The dyad motif discovered after SDD matches the CRP-N motif of H. influenzae, which might be due to the greater number of CRP-N related sequences (41) as compared to CRP-S related sequences (13). The two dyads DiSCo yields clearly resembling the two motifs. The CRP-S sequences again are clearly clustered together yielding a strong CRP-S like motif. Hence, despite the two half-sites differing in both sets, DiSCo successfully partitions the sequences into those enriched with the speciﬁc subtype of CRP hetero-dimers.

Figure 5.

Motifs found on CRP-N and CRP-S sequences of H. influenzae. The dyads discovered in majority polled run after SDD (top row) and DiSCo (bottom two rows) are shown. SDD yields a dyad whose components have clustered together the half-sites of both types of H. influenzae CRP motifs (Figure 3A and B). DiSCo however is able to successfully identify two clusters C1 and C2 whereby C1 is enriched with a dyad similar to the H. influenzae CRP-N motif (Figure 3A) and C2 with one that resembles the H. influenzae CRP-S motif (Figure 3B). Average misclassification error = 0.23.

Sequence properties of putative cofactors

To study if there exist additional sequence motifs that are needed for regulation of CRP-S sequences, we use DiSCo to analyze the target sequences of both CRP variants to search for dyads where one motif is of length 22 bp. The aim here is 2-fold. One, to study if DiSCo is able to identify the two CRP variants along with their corresponding sequence subsets in each species; and two, to investigate additional sequence signals that might co-occur with each CRP binding site variant, and might aid in the decision of the specific mode of regulation employed by CRP. Since Sxy itself lacks a DNA binding domain and no Sxy binding sites are known, we run DiSCo multiple times with varying parameter values, like motif widths for the second component of the dyad and multiple gap values. This is a typical procedure for many biological problems, when no additional information is known about the potential cofactors. For one of the components, we search for 22-bp long motif. In both species, irrespective of the width of the second component, DiSCo successfully identified the CRP motifs and their target sequence subsets (clusters). On searching for motif pairs of widths W = 8 and w = 22, within a maximum gap of 10 bp on both strands, we found both CRP-N- and CRP-S-like sites co-occurring with AT-rich motifs (data not shown). Previously, Cameron et al. (43) had observed A + T runs upstream of the CRP-S sites in H. influenzae that were required for promoter activation. From our analysis, such motifs seem to be present close to CRP-N sites also. On reversing the order of motif widths (W = 22 and w = 8 bp), again yields AT-rich motifs (data not shown). It seemed that there are only slight differences between the associated motifs of each CRP variant. However, for their motif analysis, Cameron et al. (43) aligned sequences that were 200-bp upstream of genes and found E. coli σ70-like sites downstream of the CRP-S sites. Following that study, we also restricted our search space to 200-bp regions. For this search, we used the same motif length parameters W = 22 and w = 8 bp, and a maximum gap of 20 bp. We searched both strands and the forward strand only. In both cases, we identified two clusters, each enriched with one type of CRP motif. In the latter case, though, for the cluster enriched with the CRP-S like motif, the second component matches the first half of the E. coli σ70 motif (Figure 6) while the cluster enriched with the CRP-N like motif had a second motif which is AT-rich, but dissimilar to the TTG stretch of E. coli σ70. Hence, while the CRP-S sites seem to have at least part of an E. coli σ70-like motif downstream, the CRP-N sites do not. In general, by using DiSCo to analyze this biological dataset thoroughly, we were able to identify the two CRP motif subtypes separately along with the distinguishing possible cofactor motif. This shows the direct applicability and usefulness of DiSCo in addressing biological problems.

Figure 6.

Motifs found on CRP-N and CRP-S sequences of H. influenzae with one motif of complete CRP length. The dyads discovered in majority polled run after SDD (top row) and DiSCo (two bottom rows) when the search is performed for the complete CRP motif as the main motif, are shown. SDD yields a pair of motifs, one of which has grouped together the complete CRP-N and CRP-S motifs of H. influenzae (Figure 3A and B), and the other is an AT-rich motif. In contrast, DiSCo successfully identifies two clusters C1 and C2 where C1 is enriched with the H. influenzae CRP-N-like motif (Figure 3A) and C2 with the H. influenzae CRP-S-like motif (Figure 3B). Additionally, the second motif discovered in C2 closely resembles the first half of the E. coli σ70 motif, found previously also by Cameron et al. (43). Average misclassification error = 0.24.

Evaluation on the AP-1 and CREB dataset

AP-1 and CREB are two TFs with similar but not identical binding sites, which mainly differ at the 3′ end of the core regions (Figure 7A). We tested whether DiSCo is able to identify the two types of binding sites, automatically partition their target sequence sets and in the process identify the surrounding sequence motifs associated with them (if any). To this end, we pooled together and analyzed the complete set of sequences containing both kinds of TFBSs. We searched both strands of all sequences for motif pairs of widths 7 bp each and a maximum gap of 5 bp using sites of both motifs to calculate the clustering measure in DiSCo (i.e. wrt = 3). The motifs found in the majority polled runs of SDD and DiSCo are presented in (Figure 7B, P-values are reported in Supplementary Table S1). SDD identified a dyad where one motif is similar to both AP-1 and CREB. In other words, the two kinds of TFBSs are pooled together (Figure 7B, top row). However, DiSCo segregates the two sequence sets into two clusters and identifies the individual TFBS motifs (Figure 7B, two bottom rows). The cluster of sequences enriched with the CREB-like motif, contains another co-occurring motif with high-ranking matches to CAC binding motif and Pax-4 in TRANSFAC. On comparing with JASPAR (47) (version 2010), this motif matches Krueppel-like factor 4 (KLF-4). The protein KLF-4 has been shown to be involved in the regulation of mouse B2R promoter by the formation of a higher-order complex with CREB and p53 in conjunction with the co-activator p300/CBP (CREB binding protein) (48). It is likely that CREB and KLF-4 are together involved in the regulation of other genes too, using possibly a similar mechanism. However, we could not find any evidence for the second motif we identified in the sequence set enriched with the AP-1–like motif. In summary, for this set of TFBSs, surrounding sequence lengths and search parameters, the TFBSs of CREB tend to co-occur with those of KLF-4.

Figure 7.

Analysis on the AP-1 and CREB dataset. (A) Logos of AP-1 (V$AP1_Q2_01) and CREB (V$CREB_Q4_01) matrices from TRANSFAC are shown. (B) Logos formulated from the best scoring sites of the dyads discovered in the majority polled run after SDD (top row) and after clustering (DiSCo; two bottom rows) on the complete set of AP-1 and CREB sequences are shown. SDD yields a dyad composed of a CREB-like motif (STAMP best match to ATF4, E-value ∼10e-09 and to CREB with E-value ∼9e-08) and a motif that matches DEAF1 (E-value ∼2e-04). The clusters resulting from DiSCo are enriched with AP-1 and CREB motifs, respectively. The cluster C1 [second row in (B)] is enriched with a dyad whose components match CREB and CAC-binding motif (E-values of ∼2e-11 and ∼10e-04, respectively). On comparing with JASPAR, the second motif best matches KLF-4 (E-value = 4.4e-03). The components of the dyad discovered in cluster C2 [third row in (B)] match AP-1 and Adf-1 (E-values of ∼6e-11 and 4e-03, respectively). Average misclassification error rate = 0.22.

Evaluation on the NF-κB dataset

NF-κB has two highly similar sites enriched in pathway-speciﬁc promoters (14) (Figure 8). However, there are a couple of points that need to be noted here. First, the spacer lengths in the two sites is different; four or five bases for Toll-speciﬁc sites, and two or three bases for the Imd-speciﬁc sites. Since we do not distinguish between variants based on intra-site spacer lengths, this information is not taken into consideration by DiSCo. Second, the Toll and Imd sets also differ in the number of sites they have per promoter. While most of the promoter sequences in the Toll dataset contain only single κB site instances, most of the promoters in the lmd dataset contain multiple κB sites. This implies that while the total number of Toll and Imd promoters was 16 and 11, respectively, the total number of sites was 17 and 21, respectively. Finally, in approximately half of the Imd sites the 3′ half-site diverged from the canonical site also, which makes it difficult to use dyad discovery to distinguish between the Toll and Imd sets, since differences already exist among the promoters in the Imd set. Keeping these issues into consideration, we used DiSCo to analyze the complete set of Toll- and Imd-speciﬁc sequences. We searched for motifs of widths 4 and 5, respectively, on both strands and we calculated the clustering measure based on the sites of the second motif only (i.e. wrt = 2). The logos of the best scoring sites in the dyads found in the majority polled run by SDD and DiSCo are shown in Figure 8 (P-values are reported in Supplementary Table S1). Since the two subtypes are highly similar, SDD predicts a dyad that summarizes the slight dissimilarities by clustering the two pathway-specific NF-κB motifs into one. On the other hand, DiSCo successfully separates the sequences into two clusters that are enriched with one kind of dyad each. Since here the dyad is comprised of the whole NF-κB motif in each set, the two pathway-specific motifs are hence identified.

Figure 8.

Analysis on NF-κB dataset. (A) Logos of Imd- and Toll-specific NF-κB sites. (B) Dyads discovered in the majority polled run after SDD (top row) and DiSCo (two bottom rows) for the complete set of Imd- and Toll-specific NF-κB sequences. SDD yields a dyad that has pooled the two pathway-specific subtypes of sites. The clusters resulting from the Step 3 of DiSCo are enriched with Toll specific [C1, second row in (B)] and Imd specific [C2, third row in (B)] κB sites, respectively. Average misclassification error rate = 0.28.

DISCUSSION

Advanced technologies that enable the collection of high-throughput in vivo and in vitro TF–DNA interaction data have shed new light on the changes in the DNA binding preferences of a TF in vivo. For eukaryotic organisms with complex mechanisms of gene regulation, the in vivo binding preferences of a TF frequently depend on the presence or absence of cofactors. Research has unearthed examples where subtypes of TFBSs contribute to the regulation of different pathway- or function-specific target genes. Typically, the standard approach for analyzing TF-bound sequences is to perform simple de novo single or dyad motif discovery. In this work, we presented DiSCo, a first attempt to computationally address the following question: given a set of unaligned DNA sequences bound by a TF A, identify putative subtypes a1 and a2 of TFBSs that depend on the presence of TFBSs for possible cofactors B and C, respectively. The main steps of DiSCo consist of dyad motif discovery on the complete sequence set, followed by clustering of sequences based on the similarity of the found dyads, and finally performing dyad motif discovery in each cluster separately. The clustering step helps in reducing the search space, making the second dyad motif discovery more efficient. Through analysis of both artificial and biological datasets, we demonstrated that the approach of DiSCo improves on TFBS detection as compared to standard dyad motif discovery. This was so, even when we considered the top two dyads of the standard motif discovery method or when we masked the occurrences of the first dyad and then searched for a second one (see Supplementary Data). In its current form, DiSCo uses BioProspector (32) for dyad discovery, although the underlying approach is adaptable to any appropriate dyad-discovery algorithm. Naturally, DiSCo inherits the same problem of signal-to-noise ratio that all motif finders have. In most practical applications where no prior knowledge is available, the user typically performs multiple searches with varying parameter values (motif length, promoter length, etc.) in order to identify motifs that may have biological significance. Also, DiSCo is flexible in the choice of the clustering method and the pair-wise similarity measure it uses. The current clustering method used to extract two broad clusters can be used to recursively extract more clusters, in effect yielding multiple instances of dyad subtypes. Clearly, in this case, larger input datasets may be needed. By default, DiSCo performs the clustering based on the weaker of the two motifs to enable better discriminating power. However, when this motif has very poor IC (close to the background), then using both halves of the dyad for clustering might be more efficient. Similar to most other SDD algorithms, DiSCo reports the average IC of the dyads discovered after running multiple initializations, which can be used to compare the quality of dyads predicted across multiple parameter settings. The DiSCo approach can also be used to analyze a stringently selected set of bound sequences to identify dyads with relatively short inter-motif distance d. But even stringently selected peaks from ChIP-chip or ChIP-seq data tend to be noisy. When we analyzed noisy artificial datasets where 20% of the sequences did not contain any motif, we found that DiSCo performance was fairly stable and superior to SDD (see Supplementary Data for results with additional noise levels). Large-scale datasets (e.g. from ChIP-chip and ChIP-seq experiments) are paving the way for in-depth analysis of in vivo binding preferences of a TF, and an approach such as DiSCo has the potential to help elucidate the preferential mode of TF–DNA interaction under different conditions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: National Institutes of Health (grants R01LM007994 and R01LM009657). Conflict of interest statement. None declared.

46 in total

1. The Comprehensive Microbial Resource.

Authors: J D Peterson; L A Umayam; T Dickinson; E K Hickey; O White
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes.

Authors: X Liu; D L Brutlag; J S Liu
Journal: Pac Symp Biocomput Date: 2001

Review 3. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

4. Identifying target sites for cooperatively binding factors.

Authors: D GuhaThakurta; G D Stormo
Journal: Bioinformatics Date: 2001-07 Impact factor: 6.937

5. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays.

Authors: M L Bulyk; X Huang; Y Choo; G M Church
Journal: Proc Natl Acad Sci U S A Date: 2001-06-12 Impact factor: 11.205

6. Bipartite pattern discovery by entropy minimization-based multiple local alignment.

Authors: Chengpeng Bi; Peter K Rogan
Journal: Nucleic Acids Res Date: 2004-09-23 Impact factor: 16.971

7. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase.

Authors: C Tuerk; L Gold
Journal: Science Date: 1990-08-03 Impact factor: 47.728

Review 8. Transcriptional regulation by cAMP and its receptor protein.

Authors: A Kolb; S Busby; H Buc; S Garges; S Adhya
Journal: Annu Rev Biochem Date: 1993 Impact factor: 23.643

9. Consensus DNA site for the Escherichia coli catabolite gene activator protein (CAP): CAP exhibits a 450-fold higher affinity for the consensus DNA site than for the E. coli lac DNA site.

Authors: R H Ebright; Y W Ebright; A Gunasekera
Journal: Nucleic Acids Res Date: 1989-12-25 Impact factor: 16.971

10. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome.

Authors: K Robison; A M McGuire; G M Church
Journal: J Mol Biol Date: 1998-11-27 Impact factor: 5.469

8 in total