Literature DB >> 19911049

Discovery of regulatory elements is improved by a discriminatory approach.

Eivind Valen¹, Albin Sandelin, Ole Winther, Anders Krogh.

Abstract

A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to "pattern drowning". We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.

Entities: Chemical Gene Species

Mesh：

Year: 2009 PMID： 19911049 PMCID： PMC2770120 DOI： 10.1371/journal.pcbi.1000562

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

The rapid emergence of experimental techniques that can probe for functional elements at whole-genome scales[1] necessitates computational methods to analyze data in these settings. In particular, methods that locate promoters or measure gene expression on genome-wide scales (e.g. [2],[3]) must be complemented by algorithms that can find the active regulatory elements within the larger promoters. Ab initio computational search for transcription factor binding sites (TFBS) in DNA sequences is often termed “motif discovery”. “Motif” here refers to a general pattern describing what DNA sequences the transcription factor binds[4]. Motif discovery is one of the classical problems in computational sequence analysis and can be briefly stated as: Given a set of sequences containing one or several short overrepresented sites, locate these and produce a model describing them. There are two main avenues used to attack this problem: i) enumerative algorithms based on word counting, such as [5],[6], and ii) pattern-based approaches often using position specific weight matrices (WMs), which scores sites based on position specific weights [4]. Since the binding preferences of transcription factors (TFs) are not easily captured by a single word or consensus string, pattern-based approaches can give solutions closer to the biological reality and it has been argued that the matrix score is related to the binding energy [7],[8]. However, such approaches correspond to the problem of finding local, optimal multiple alignments, which is NP-complete [9]. Therefore, almost all pattern-based motif finders use statistical optimization methods such as Gibbs sampling or expectation maximization [10],[11]. A typical instance of motif discovery starts with a set of upstream promoter regions of co-expressed genes suspected to be co-regulated and by extension more likely to be under control by the same regulatory machinery. This set is called the “positive set” and most methods proceed from here by locating motifs that are in some way statistically overrepresented in this set. The most successful applications of motif discovery have been in organisms whose regulatory information is densely aggregated around transcription start sites, such as Saccharomyces cerevisiae (baker's yeast). In mammalian genomes, regulatory information is spread out over wider regions, which makes “pattern drowning” a significant issue; in other words, the information in the regulatory sites is too small to stand out in the large genomic region of interest. In this context, the accuracy of contemporary pattern finders is not sufficient for many biologically important problems [12]. Most methods operate with some notion of a background model describing “generic DNA” against which the over-representation is measured. The model is often a multinomial or a Markov model. The choice of model is important for obtaining good results [13],[14]. However, most such models have difficulty in capturing the complexity of the highly heterogeneous mammalian genome sequence, which has a multitude of different promoter architectures[15], numerous interspersed repeats, low complexity sequences, CpG islands, etc. [16]. Instead of simplifying the underlying DNA sequence by a general model, we take this to its extreme conclusion and use a very large set of promoters as the actual background instead of building a model describing the sequences in the promoters. For simplicity, we use the term “negative set” to describe the background set; this is strictly speaking not true as sites could occur in this set at a much lower frequency, since real promoters are sampled randomly. By contrasting the sets, it is possible to see what common features make the sequences in the positive set unique. Discriminatory motif searching is not a new idea; several methods have been developed that take advantage of a negative set [17]–[24]. However, many of these use word-based models [19]–[21], which might not capture the diversity of binding sites. Others again use PWMs, but have binary hit models that do not distinguish between hits as long as they are over a threshold [22]. A discriminatory approach similar to ours has been combined with the use of expression data [18], but depending on the regions that are being investigated this might often not be available or even possible. We adopt an approach similar to DEME [23] to identify the most discriminative set of motifs by modeling the sequence labels (positive or negative) rather than using the conventional generative approach[10],[11]. However, there are some important differences to DEME. Firstly, DEME uses a global string-based search followed by a local gradient refinement, which may miss patterns that are not well-represented by a consensus string, whereas we use a global optimization technique (simulated annealing) for optimizing the model, which does not have this limitation, although it may have others (see below). Secondly, our method (Motif Annealer - MoAn) uses and optimizes a threshold, and uses an enhanced suffix array (ESA) to speed up pattern searches. Thirdly, in MoAn the length of the motif is also optimized. DEME is also particularly targeted towards proteins while our approach is intended for use with DNA. Specifically, we use conditional maximum likelihood to estimate the WMs and their thresholds such that the probability of the positive and negative sets is maximized (see Methods). Thus, the resulting matrices cannot be derived from the frequency matrix for the sites found – it is rather the matrices that lead to the best discrimination. The probability of a sequence is calculated as a product of the probabilities given by the matrices matching above a threshold and a simple null model for non-matching regions. From this and prior probabilities for matches in the positive and negative sets, the probability of the set label (positive or negative) is calculated. In this probability the background model cancels. The total likelihood is a product of the class probabilities for all sequences (positive and negative). This conditional likelihood leads to a non-trivial optimization problem which is handled using simulated annealing (see Methods), where we iteratively change the WMs and their thresholds, retaining changes that lead to higher discriminatory power using the Metropolis-Hastings algorithm [25],[26]. Given sufficient iterations, the method guarantees convergence on the optimally discriminatory motifs. To cope with the vast size of the sets we utilize a highly efficient data structure, the ESA, for searching DNA for pattern instances[27]. With reasonable cutoffs, this reduces the computation by an order of magnitude[28].

Results

We evaluated our method by comparing its accuracy to a set of widely used motif discovery methods (MEME[29], DEME[23], Weeder[5] and NestedMICA[14]) in several different ways. In all runs, we used the same background set, which consists of 1000 experimentally defined promoters randomly sampled from the mouse genome (Text S1). The evaluation statistics are the same as used in [12] (see Methods) and we also pooled the results from all motifs (grouped by length of the input sequence; see below) and calculated the compound statistics on this. To reduce the influence of the optimization method, we ran all non-deterministic methods five times on each set selecting the best run according to their own scoring function. In line with the recommendations of [12] we used synthetic data sets for the inter-method comparison. These were constructed by taking experimentally defined promoter regions based on strong CAGE tag clusters [2] and planting binding sites from various TFs inside these (Text S1). To decrease possible biases for the methods towards certain specific motif types, we randomly selected one TF from each of the 11 JASPAR[30] families as well as an example of a zinc-finger factor (Table S1). For a given matrix, we randomly chose sites from experimentally validated binding sequences used for constructing the JASPAR matrix instead of generating sites using the matrix. Since the accuracy of motif discovery methods normally deteriorates when sequence length is increased (“pattern drowning”), we evaluated the various methods on sets with sequence lengths varying between 200 and 1200 nucleotides (Table S3). This gave a total of 84 sets (12 motifs ×7 lengths) with 100 sequences in each. Sequences had a site from a given motif planted with a probability of 0.5. For those methods that support it, a background/negative set was provided containing 1000 sequences sampled in the same way and with the same length as the positive sequences. We used default settings for all methods except where there were obvious reasons not to (Text S2). Since DEME requires motif length as input we decided to input the correct length of the matrix. This provides DEME with an informational advantage over the other methods. Fig. 1 (and Figs. S4, S5, S6, S7, S8) shows a significant performance gain in using MoAn compared to the other methods as measured by Matthews correlation coefficient on nucleotide level (nCC) and average site performance (ASP) – an average over the positive predictive value and the sensitivity on binding site level (see Methods for details). With both measures, MoAn performs better than any other method on all sequence lengths. In particular, the performance is not as affected by increasing the input sequence length as the other methods; at certain sequence lengths(800, 1200) MoAn has more than twice as high ASP values as the second best method. We also evaluated MoAn with the applicable subset of the evaluation set proposed by [12](Text S3 and Table S4), where the OligoDyad, AnnSpec and MoAn achieve the highest sASP values. We note that this set is challenging as none of the methods perform well overall, and the difference in performance between methods might not be significant due to this fact. In addition, this set does not evaluate how well the method can deal with increasing lengths of input sequences, which is highly relevant.

Figure 1

Synthetic set evaluation.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.

Synthetic set evaluation.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.

Correlation of score and solution

The relationship between our objective function and the correct solution was assessed by plotting the MoAn scores against the sensitivity obtained in all five runs on each of the 84 sets (not just the best from each run) (Fig. 2). There is a clear correlation (Pearson CC: 0.90) between these two measures. There is a similar correlation with other measures, such as the nCC (Fig. S1).

Figure 2

Correlation of MoAn's objective function (Sc) and site sensitivity (sSn).

All 5 runs on the 84 synthetic sets are used.

Correlation of MoAn's objective function (Sc) and site sensitivity (sSn).

All 5 runs on the 84 synthetic sets are used. This finding is important, because it indicates that the raw score is an indication of quality independent of the motif analyzed. It also shows that choosing the best scoring run of several will often give the best result.

Repetitive sequences

Aside from the problem with decreasing sensitivity as the length of the input sequences increase, repetitive sequences represent a severe problem for motif discovery, as these will often seem to be over-represented, and therefore it is common to mask these repeats. However, masking is always arbitrary, and some repeats are functional [31],[32], so indiscriminate repeat masking is not optimal. When using a large negative set, repeat masking is unnecessary since repeats, if commonly occurring, will feature in the negative set and therefore be avoided as potential hits in the positive. At the same time, we can avoid the reverse problem – if a type of repeat actually is over-represented in the positive set, it can still be found. To demonstrate the insensitivity to repeats on a practical level, we planted repetitive sequences in each of the positive sets with a slightly higher frequency than the real motifs and ran our predictor on these sets both with the normal background and with a background similarly spiked with repeats. Specifically, we planted 1 to 10 consecutive instances of CACTA with a probability of 60% in each sequence. Fig. 3 shows, as expected, that the results do not deviate much from the repeat-less run when repeats are planted in both the positive and negative sequences, while the method picks up the repeats instead when there are no repeats in the negative set. We also performed this test using decoy motifs instead of repeats with similar results (Text S4, Fig. S2).

Figure 3

Repeat Assessment.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with repeats planted in the two sets.

Repeat Assessment.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with repeats planted in the two sets.

Real data

Evaluation of methods on real data is difficult and often a poor indication of general performance due to lack of insight into the correct solution [12]; on the other hand, it is necessary to show that the method can be applied to real problems. MoAn and four other methods were run on a collection of real data sets consisting of the binding sites of four human and mouse factors from the PAZAR database[33] and their associated genomic sequence. The sets were split by organism into 7 sets and the regions adjacent on the genome were merged resulting in sets ranging in size from 14 to 118. The merging means that the base sequences can have a varying number of sites and may be of different lengths. The sets were then subsequently enlarged by adding an equal number of randomly selected promoters to increase the difficulty (Text S6 and Table S5) and also padded with their cognate upstream and downstream regions of varying lengths (200–1200, as in the synthetic evaluation) to estimate the impact of noise. Fig. 4 shows the performance over the real sets. MoAn's performance is clearly superior, but not as spectacular as in the more controlled environment with synthetic sequences.

Figure 4

PAZAR set evaluation.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.

PAZAR set evaluation.

The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods. We speculate that the reason for this is that the background and foreground of the synthetic sets are essentially sampled from the same pool (RefSeq promoters), while we have made no effort to customize the background for the PAZAR sets. If the genomic environment of the factors differ from normal promoter sequences this could lead to a reduced performance. There are also fewer sets (7 versus 12) in this evaluation leading to a higher variability. We report additional trials using ChIP-chip data in supplementary material (Text S7, Fig. S3 and Tables S6, S7). MoAn has also been used successfully to discriminate between binding regions of human ESR1 and its paralog ESR2; the results were comparable with matrix-scanning approaches with pre-defined motifs[34].

Co-occurrence of binding sites

An additional aspect of the motif finding problem is that TFs often work by forming complex interactions [35]. Examples include mutually exclusive and cooperative binding. Clusters of TFBSs are commonly termed cis-regulatory modules, and are often responsible for tissue-specific expression. We try to capture these interactions by incorporating co-occurrence of sites from different motifs into our model, with the goal of further increasing predictive power. To test whether our objective function is capable of capturing interactions between factors we constructed a set where co-occurrence of sites from different motifs occurs. We randomly chose 5 pairs of new motifs (Table S2) and planted their corresponding sites in a positive set of 100 promoters with a 40% chance of co-occurrence and 10% of single occurrence. We then spiked the background set with sites from each of the motifs (10% chance each for all sequences) to mimic a situation where it is the interactions of the two sites rather than single sites that are responsible for the regulation. MoAn was then run in co-occurrence mode and compared to two single-occurrence runs in a series. In the serial runs we masked out the predictions from the first iteration before running the second iteration. In Fig. 5 the ASP and nCC is plotted. In our experiment three of the pairs turned out to be composed of motifs with relatively low information, leading to poor performance. However, the two remaining ones show that modeling of co-occurrence can significantly improve performance. This extended model is unfortunately computationally taxing and requires more than twice the number of iterations compared to the single prediction.

Figure 5

Performance of co-occurrence vs. serial runs.

The average site performance (lines) and nucleotide correlation coefficient (bars) of co-occurrence and serial runs on 5 different sets with co-occurring motifs.

Performance of co-occurrence vs. serial runs.

The average site performance (lines) and nucleotide correlation coefficient (bars) of co-occurrence and serial runs on 5 different sets with co-occurring motifs.

Discussion

In this work we have shown the value of using a large negative set instead of a pre-defined background model in motif discovery. Using raw sequences more accurately portrays the background than any general model and therefore higher discriminatory power is achieved. This method is also much less sensitive to “pattern drowning” in larger sequences, which is a bottleneck in computational analysis of mammalian regulatory regions. However, while our method takes a significant step towards routine motif discovery on large sequences, the problem cannot be considered fully solved. In particular, MoAn accuracy may be further improved by incorporating information on evolutionary constraints (phylogenetic footprinting)[36] or DNA accessibility[24],[37]. In our opinion DEME is the best runner up of the methods. It often predicts the correct motif and has a high sensitivity, but often at the cost of a large number of false positives as it predicts also in those sequences not containing a site. MoAn seems to be better at balancing the sensitivity and specificity. On the other hand DEME is also given an artificial advantage by having the correct motif length as input and it is uncertain how advantageous this is. Weeder performed surprisingly poorly given its stellar performance in a recent evaluation[12]. This might be due to motif selection which we did according to the most redundant motif, but was in [12] done in a more complicated manner not part of the current Weeder package. This procedure led to no predictions on several of the harder sets which might give Weeder a statistical advantage (as discussed in [12]). A concern that might be raised is that optimizing a cutoff might lead to a conservative estimate of binding sites at the expense of weaker sites. However, assessing this is hard since experiments have their own thresholds in the post-analysis and any evaluation of MoAn's threshold will be dependant upon those. Investigations where we artificially forced the cutoff to remain low, lead to a reduction in performance (data not shown). We address this potential problem indirectly by providing a matrix that can be used to search sequences at a lower threshold. Future improvements of MoAn will focus on the optimization algorithm, which currently is not robust enough to always produce reliable results. In our current implementation we avoid this problem by running the algorithm many times to see that the solution is stable.

Methods

Evaluation is done on both site and nucleotide levels. The statistics used are similar to those in the recent large scale evaluation [12]. To get a compound statistic for all motifs at each length we used what is there described as the “combined” method for summarizing. This consists of treating all sets of a given length as one big set, summing up all the basic statistics below (nTP, nTN … sFN) before calculating the compound statistics. This removes the problem of undefined statistics in those cases where a method does not predict any sites.

Basic statistics

nTP Number of nts part of a site correctly predicted. nTN Number of background nts correctly predicted. nFP Number of background nts predicted to be part of a motif. nFN Number of nts part of a site predicted as background. sTP Number of real sites that share over 50% of its nts with a predicted site. sFP Number of predicted sites that share less than 50% of its nts with a real site. sFN Number of real sites that share less than 50% of its nts with a predicted site. Note that we are more conservative with respect to the site prediction than [12] in that we demand at least half of the nucleotides overlapped to get a single sTP.

Compound statistics

Derived from the basic statistics:

Objective function

A sequence is assumed to be described by a mixture model consisting of a background distribution and a set of WMs describing the binding affinities of the TFs. The WMs contain log-odds scores of the type:where is the position in the WM, is a letter in the DNA alphabet and is the probability of having letter at position in the motif described by . The score of a matrix aligned at a position in a sequence is therefore:where is the DNA letter at position in sequence . The aim is to discriminate between two sets of sequences , where label denotes the positive set and the negative. The prior probability of binding site occurrence in a sequence contained in set is called . We assume that there is a marked difference in the site occurrence between the two sets and want to construct a score that captures how well a set of WMs describe this difference. Using two WMs as an example, and , there are four possible ways for a sequence to be generated. With prior probability it contains no sites and is only generated by the background model . Or, with prior probability , it contains a single site (one of the two) corresponding to one WM positioned at nucleotide number ( is equal to 1 or 2 corresponding to the two different matrices). This is written , where is the score of the matrix aligned to the nucleotides at position (eq. 2) and 2 is the base of the log scores contained in the WM. Note that the log scores in a WM are divided by the background model, so the background () cancels out in sites where the motif occurs. The final case, with prior probability , is the co-occurrence of two sites in a sequence, which is . However, this is only correct when the sites are not overlapping since otherwise the overlapping nucleotides would be included in the product twice. Therefore we disallow overlaps. For efficiency reasons, we do not calculate the score in its entirety. We assume that it is the strong sites that contribute the most to the equation and introduce a cutoff for each WM on the minimum score of a site. This enables an efficient search in the ESA. This is not without biological merit since WM scores and binding energies for known TFs are correlated, and at some point the binding energies of a TF and a poor binding sequence must be too small to matter [4]. It is also a standard method to use when scanning with known matrices [38]. So we only consider sites that score above a threshold, which is called for matrix . Then the probability of a sequence from the set being generated by the WMs iswhere is the expectation over of over all predicted sites:with being the step function (1 above 0 and zero otherwise). The co-occurrence expectation is defined in a similar way with overlaps disallowed. The effective weight of no sitesaccounts for extra weight given to no sites due to alignments not meeting the threshold. With this definition, is the probability or generative model of the sequence conditioned on the WM and threshold, . To find the WMs that best explain the difference in occurrence between the sets we use a discriminative objective function based on the probability of the labels given the sequences and WMs, formally: This is the logistic likelihood function for binary classification, see e.g. [39]. The discriminative model can thus be viewed as logistic regression with an adaptive set of basis functions. For multiple sequences assumed to be independent, the joint probability is the product of the single sequence probabilities over all sequences in both the positive and negative set: We refer to this function as the (log likelihood) score, . Based on the sequence density we can use Bayes theorem to calculate the probability of the label given the WMs , the thresholds , and the sequence : We observe that the prior probability of is proportional to the number of sequences in the set divided by the total number of sequences . A very high threshold will give no matches, and the probability will then be a constant given by the priors and the size of the two sets. Matches that score above the threshold in the negative set will lower the score and matches above the threshold in the positive set will increase the score, so the game is to obtain as many high-scoring matches in the positive set as possible without introducing too many matches in the negative set. The prior is conservative in our runs in that we are strict about promoting hits in the positive set, but only moderately strict about disallowing negative hits. For a single matrix the prior on is 0.01; : 0.99; : 0.80; : 0.20. For two matrices: : 0; : 0.1; : 0.9; : 0.80; : 0.15; and : 0.05. These priors can be set by the user if prior knowledge is available about the set (i.e. a high confidence negative set or an uncertain positive set). In the evaluation we deliberately chose a probability of having a site (0.5) in a sequence very different from the model prior (0.99) to avoid giving our own method a big advantage. It shows that the method is not very sensitive to the choice of prior.

Optimization

The objective function outlined above is optimized using simulated annealing [40]. Informally, it proceeds by iteratively proposing a candidate solution and then accepting or rejecting it depending on how good it is compared to the current solution. It sometimes accepts changes for the worse and therefore possesses the power to escape local maxima. The hope is that it will converge on a solution that is close to optimal. Formally, this translates to a walk over the search space where in the current state , the next state is either the same or the candidate solution depending on their relative scores and a temperature parameter . The temperature parameter is lowered for each iteration using as default an exponential cooling scheme (for details see Text S5), thus incrementally constraining the neighborhood of accepted changes. Candidate solutions are proposed by applying one of several steps outlined in the list below. In the case of multiple matrices, only one is changed at a time. We perform all steps on a integer “count” matrix which is then translated into a log-odds WM prior to searching the ESA, but notice that the “count” matrix does not represent actual letter frequencies in the selected sites. The steps are: Alter the contents of the WM columns by moving counts from one random cell to another within a column. The number of counts moved is selected uniformly from 1 to the current count number for the cell. Extend the WM in either direction. A uniformly sampled number of columns (1 to 5) is added and counts of these are decided by consulting the sequence locations of hits scoring above . The counts are proportional to the counts in the columns from the extended hits, but normalized so that all columns have the same counts. Decrease the length of the WM by deleting columns. Similarly to adding columns a uniformly selected number between 1 and 5 columns are deleted. Slide the WM across the sequences. Columns are deleted on one site and extended on the other according to the two steps above. Alter the cutoff of the matrix . The cutoff is expressed in bits per column and a new candidate is proposed by sampling uniformly from 0.6 to 2 bits. Note that for the extend and decrease step there is a minimum and maximum number of columns for a motif. The default for these are 5 and 15 respectively. The matrix is initialized with random counts and the cutoff is also selected uniformly according to the last step in the list above. Termination of the optimization is only based on the number of iterations which is by default set to a rather conservative value of 30 million iterations. Time requirements for a single run is variable depending on the set size, but was for our runs comparable to NestedMICA (single threaded) and considerably faster than Weeder's “large” run and DEME.

Availability

Source code as well as data sets is freely available at the author's web site: http://moan.binf.ku.dk Correlation of MoAn's objective function (Sc) and nucleotide correlation coefficient (nCC) (0.01 MB EPS) Click here for additional data file. Evaluation with decoy motifs. Average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with decoy motifs planted in the two sets. (0.01 MB EPS) Click here for additional data file. Discriminatory power of matrices. ROC curve showing discriminatory power of matrices produced by MoAn and NestedMICA on the ESR1 data set. The line extends from the highest cutoff possible for that matrix (bottom right) to a cutoff of 0 (top left). (0.03 MB EPS) Click here for additional data file. Performance on individual sets for MoAn. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets. (0.02 MB EPS) Click here for additional data file. Performance on individual sets for DEME. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets. (0.02 MB EPS) Click here for additional data file. Performance on individual sets for MEME. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets. (0.02 MB EPS) Click here for additional data file. Performance on individual sets for Weeder. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets. (0.02 MB EPS) Click here for additional data file. Performance on individual sets for NestedMICA. The average site performance (lines) and the nucleotide correlation coefficient (bars) on the sets. (0.02 MB EPS) Click here for additional data file. Data set construction (0.03 MB PDF) Click here for additional data file. Running parameters (0.03 MB PDF) Click here for additional data file. Tompa assessment (0.03 MB PDF) Click here for additional data file. Sequences spiked with decoy motifs (0.02 MB PDF) Click here for additional data file. Annealing schedule (0.03 MB PDF) Click here for additional data file. PAZAR data sets (0.03 MB PDF) Click here for additional data file. ChIP-chip data sets (0.04 MB PDF) Click here for additional data file. Length of upstream and downstream extensions (0.01 MB PDF) Click here for additional data file. Motifs planted in single occurrence sets (0.04 MB PDF) Click here for additional data file. Motifs planted in co-occurrence sets (0.03 MB PDF) Click here for additional data file. Results on the mammalian subset of the Tompa assessment (0.01 MB PDF) Click here for additional data file. Sizes of PAZAR data sets (0.01 MB PDF) Click here for additional data file. Sizes of ENCODE data sets (0.01 MB PDF) Click here for additional data file. Performance on ENCODE data sets (0.07 MB PDF) Click here for additional data file.

36 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.

Authors: C T Workman; G D Stormo
Journal: Pac Symp Biocomput Date: 2000

3. Discriminative motifs.

Authors: Saurabh Sinha
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

4. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.

Authors: L R Cardon; G D Stormo
Journal: J Mol Biol Date: 1992-01-05 Impact factor: 5.469

5. The genome landscape of ERalpha- and ERbeta-binding DNA regions.

Authors: Yawen Liu; Hui Gao; Troels Torben Marstrand; Anders Ström; Eivind Valen; Albin Sandelin; Jan-Ake Gustafsson; Karin Dahlman-Wright
Journal: Proc Natl Acad Sci U S A Date: 2008-02-13 Impact factor: 11.205

6. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

7. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

8. Repeated recruitment of LTR retrotransposons as promoters by the anti-apoptotic locus NAIP during mammalian evolution.

Authors: Mark T Romanish; Wynne M Lock; Louie N van de Lagemaat; Catherine A Dunn; Dixie L Mager
Journal: PLoS Genet Date: 2006-12-06 Impact factor: 5.917

9. Asap: a framework for over-representation statistics for transcription factor binding sites.

Authors: Troels T Marstrand; Jes Frellsen; Ida Moltke; Martin Thiim; Eivind Valen; Dorota Retelska; Anders Krogh
Journal: PLoS One Date: 2008-02-20 Impact factor: 3.240

10. A nucleosome-guided map of transcription factor binding sites in yeast.

Authors: Leelavati Narlikar; Raluca Gordân; Alexander J Hartemink
Journal: PLoS Comput Biol Date: 2007-09-24 Impact factor: 4.475

15 in total

1. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments.

Authors: Lakshmi Kuttippurathu; Michael Hsing; Yongchao Liu; Bertil Schmidt; Douglas L Maskell; Kyungjoon Lee; Aibin He; William T Pu; Sek Won Kong
Journal: Bioinformatics Date: 2010-12-23 Impact factor: 6.937

2. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes.

Authors: Md Salimullah; Mizuho Sakai; Sakai Mizuho; Charles Plessy; Piero Carninci
Journal: Cold Spring Harb Protoc Date: 2011-01-01

3. Transcription factor expression defines subclasses of developing projection neurons highly similar to single-cell RNA-seq subtypes.

Authors: Whitney E Heavner; Shaoyi Ji; James H Notwell; Ethan S Dyer; Alex M Tseng; Johannes Birgmeier; Boyoung Yoo; Gill Bejerano; Susan K McConnell
Journal: Proc Natl Acad Sci U S A Date: 2020-09-18 Impact factor: 11.205

Discovery of regulatory elements is improved by a discriminatory approach.

Introduction

Results

Synthetic set evaluation.

Correlation of score and solution

Correlation of MoAn's objective function (Sc) and site sensitivity (sSn).

Repetitive sequences

Repeat Assessment.

Real data

PAZAR set evaluation.

Co-occurrence of binding sites

Performance of co-occurrence vs. serial runs.

Discussion

Methods

Basic statistics

Compound statistics

Objective function

Optimization

Availability

Review 1. DNA binding sites: representation and discovery.

2. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.

3. Discriminative motifs.

4. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.

5. The genome landscape of ERalpha- and ERbeta-binding DNA regions.

6. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

7. The transcriptional landscape of the mammalian genome.

8. Repeated recruitment of LTR retrotransposons as promoters by the anti-apoptotic locus NAIP during mammalian evolution.

9. Asap: a framework for over-representation statistics for transcription factor binding sites.

10. A nucleosome-guided map of transcription factor binding sites in yeast.

1. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments.

2. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes.

3. Transcription factor expression defines subclasses of developing projection neurons highly similar to single-cell RNA-seq subtypes.

4. Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models.

5. SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing.

6. AMD, an automated motif discovery tool using stepwise refinement of gapped consensuses.

7. GRISOTTO: A greedy approach to improve combinatorial algorithms for motif discovery with prior knowledge.

8. Direct AUC optimization of regulatory motifs.

9. cWords - systematic microRNA regulatory motif discovery from mRNA expression data.

10. TherMos: Estimating protein-DNA binding energies from in vivo binding profiles.